

By Ed Wrazen, Vice President, Product Marketing, Harte-Hanks Trillium Software
I often read that name and address cleansing is considered easy and a commodity compared to cleansing of other data domains such as product data, financial data, parts and materials. Boy, that annoys me, and I don’t get ruffled easily.
Certainly name and address data is more understood and generally better-structured than other types of data (though this is not always the case). As you know, companies already have generally accepted rules for names and addresses. There are fewer, less diverse attributes (first name, surname, address1, address2, city, state, etc) than in other data domains, and these are more universally known.
We know first name abbreviations and synonyms (Peggy/Margaret, Mike/Michael), and titles such as Mr, Mrs, Miss, Doctor. We also understand how addresses are specified using rules, formats, postal codes and reference data for street names/city names. So there’s a lot of universal knowledge and data already available that describe how names and addresses should appear.
Yet, there are still many variables and degrees of sophistication required for name and address cleansing to consider. Large and mid-sized organizations and businesses, particularly B2C-focused companies, have more complex issues associated with customer data that extend beyond the capabilities of most name and address cleansing technologies.
I would say there’s a big distinction between what name and address cleansing is and what I mean when I talk about “customer data quality.” Customer data is often much more complex, comprehensive and broader than first thought, and the requirements for a customer data quality solution go beyond basic name and address scrubbing.
Contact information may not always be as well structured as you would like. Unstructured name and address data present many challenges that basic cleansing solutions can’t cope with. The data may span several files or databases with different formats, structures, degrees of quality, standards and conventions. Multiple address variables may be held in one attribute in one file, but stored in several others, and data in multiple silos in multiple varying structures will require customizable transformation and merge logic before cleansing can commence.
Also, you’ll need broader international data cleansing as your business and technology deployments expand into new geographies. Often a tick-in-the-box solution may be able to handle batch name and address cleansing for one or a few countries and only when that data is well-structured. But once more countries are introduced into the mix or more complex data issues such as the ones I’ve talked about are presented, these solutions fail to deliver, and require a lot of manual effort.
This is largely due to the fact that it’s not just about the technology but also the vendor's knowledge and capability to develop country-specific rules. This demands an understanding of country-specific name and address data and being able to create appropriate context-sensitive rules specific to each country. The data quality provider has to have access to relevant country data (street names, types, towns, cities, districts, regions, postal codes) to build the rules into the software rule sets for parsing, standardization matching and enrichment.
All this available information can be exploited by data quality software solutions, but often the amount of investment to create a truly global solution, fit for the most complex and demanding requirements can be enormous – even for name and address cleansing.
A simple cleansing tool may not be able to merge disparate data from multiple silos and may require users to perform extensive formatting, standardization and even cleansing to get it into a usable shape. Not really what you want or expect from a data quality solution is it?
So, it’s clear that you can’t simply assume your data quality plan or solution will handle your customer data cleansing projects, nor simply accept that if data quality is “bundled in” as part of a suite, that it will do the job. Look deeper into your plan and requirements and evaluate your chosen vendors carefully. Ask for references and customer contacts that you can talk to for advice and guidance And, never assume that customer data is EASY!!
I totally agree with everything above.
We have been developing an address cleansing solution for Eastern Europe where there is no approved gazetteer. The local Post Office file of 6-digit postcodes contains ~ 24,000 street names, whereas we have encountered over 100,000 unique street names in customer databases.
Since the 1989 upheavals (revolutions) around Eastern Europe, there have been numerous name changes (we know of one street which has had 5 different names during the past 20 years!), none of which are centrally documented.
However, none of this can be compared with the complications resulting from 'flowery' addresses (e.g. Aleea Vladimir Mladinovici doctor. agos, stl. post mortem - yes, 8 components in the street name alone!), plus all the many abbreviations, mis-spellings etc.
Finally, the widespread habit of putting all address components into a single field with casual disregard for separators can lead to nightmares!
In summary, such situations call on extensive local knowledge which is virtually impossible to factor into a global solution.
We are nearly 'there' after a two year struggle!
Posted by: Bill | 02/01/2010 at 12:35 PM