By Michele Goetz, Vice President of Product Marketing, Harte-Hanks Trillium Software
Data profiling is one of those activities that always seem to be leveraged for a data integration, system consolidation and MDM project. The effort is conducted on disparate databases and data marts, and insight is supposed to be gleaned from the exercise. Yet, what is really learned?
Did you really think about and map a framework for what you wanted to understand? Do you find that you still have conflicts and implementation issues because there are things you didn’t discover during data profiling that now raise their ugly heads?
I find more and more that the biggest issue in data profiling exercises is that investigation and discovery fail to account for the three states of data:
- Creation
- Change
- Consumption
If you only point data profiling at a single source, you don’t have a full picture of how data flows and is utilized by an organization. Data quality issues arise in the shadows and are difficult to address because there is no context of data’s value and use within the business. So, you need to understand and compare the state of data from point-of-capture databases, within supporting staging areas, and within a consolidated or federated data warehouse.
A simple way to test this is to see what happens when a call from a business unit comes in to address a data issue. Typically the triage goes something like this:
Step 1: You look at the source system to validate the data quality issue
Step 2: You look at the how data comes into the system
Step 3: You look at the business process that generates or changes the data as it comes into the system.
Quite the archeological dig!
Next time you undertake projects in data management or business process optimization, DO:
- Develop a data profiling practice that investigates data conditions across various data states,
- Start from top-down of a business process rather than bottom-up from the data warehouse to assure data context,
- Maintain data profiling rules utilized in investigation, add in data quality business rules, and monitor the three states of data to stay ahead of issues.
What do you think? Good process? Did I miss anything?



