by Keith Kohl, VP Product Management, Trillium Software
It’s been estimated that 40%-60% of time is spent accessing and preparing data before an analyst can do anything with it. An industry analyst told me last week, “it’s more like 75-80%!” But what about ensuring the quality and completeness of the data so you can trust the results of your analytics? How can you ensure you are performing analytics on complete, accurate, and trusted data?
Today we announced a new product in our Big Data portfolio: Trillium Refine™. Trillium Refine brings self-service data preparation to a new level that now includes world-class global data quality capabilities all working on Big Data. I’ll use this BLOG and a series of others to help explain what we are launching to the market.
Organizations have data all over the place in many different sources and file types. An analyst that is tasked with running a successful digital marketing campaign or identifying credit card fraud, how do they know how to retrieve data from an Oracle database, a JSON or Parquet file from a third-party, an S3 bucket in AWS, Big Data such as HDFS or a MongoDB NoSQL database, and so one. And once they get access to the data, what next…
Trillium Refine walks the analyst through a simple six-step process to prepare the data:
- Selecting which data sets you need to work with
- Joining all of these data sources
- Enriching data such as adding a product name & size from a SKU
- Choosing the columns to work with
- Filtering out data
- Aggregating data such as min, max, average, count, and count distinct
While the steps here are very beneficial to save the analyst time, the next part is very unique.
A series data quality steps are used to parse, standardize, match and enrich the data. In order to create a single view of customer or product for instance, we need to have everything in a standard format to get the best match.
Let me use a simple example of parsing and standardization.
100 St. Mary St.
As humans we know that is an address and it is 100 Saint Mary Street because we understand the position. Think about all of the different formats for names, addresses, product names (books, toys, automobiles, computers, manufacturing parts, etc.), company names, etc.
Once we all of this data in a common, standard format we can then match. But this can even be complex. Think about a name, Phil Galati (I’ll pick on our CEO). The name could be in many different formats or even misspelled:
As a marketing analyst, I have a new product to promote and I must make sure I’m targeting the right customer/prospect. If Phil lives in a small town in zip code 60451 (New Lenox, IL – my home town!), he’s probably the only one on that street. But if his zip code is 10023 (upper west side of NYC), there might be more than one person with that name at that address (think about the name Bob Smith!).
Next step is to enrich that data with global coverage data. Trillium Refine has global coverage out of the box.
Finally I need to get the data into a format the analyst can do something with it. Trillium Refine can put the data into a format that can be served up in your favorite analytics or visualization tool (Tableau, Qlik, etc.).
Now imagine you have to do this everyday with 10’s or 100’s of millions of records of data…or even more. Check out Trillium Refine at www.trilliumsoftware.com/products/big-data.com