Over the past few weeks, I have been reviewing the big data startup space – looking for companies with key ideas in exploiting big data. During this exercise, I noticed that a number of start-ups focused on “data cleanup” tools and services. Though a useful category of tools (similar to ETL tools in the world of data warehousing), it raised the following question: Given the following theme underpinning big data efforts: the need to obtain – new “generalizable” insights from rows upon rows of “instance level” data, Is it really that important to “clean up” your data ? Why/Why not? What types of tasks in “cleaning” data are worthwhile ? Continuing my series of big data related posts (I, II, III), here are my current thoughts on this topic.
For purposes of this discussion, let us assume that we have collated our “data sources” into a canonical table – Each row is an “instance” of the data element at hand, and each column is an attribute (property) of that instance – such as for a customer (a row), properties are: “address, age, gender, payment info, history (things bought, returned) etc.In such a conceptual model, what are the issues that may require “data cleaning” ?
- fixing column types (should hex ids of customers be stored as strings or as longs ?)
- fixing each entry in a cell (row/column) to confirm to type (removing errors in copying from source – age columns with text in them, artifacts of data modeling – converting ASCII text in varchar to unicode )
- fixing each entry in a cell to correlate “properly” with entries in other cells for the “same” row (for example, the age column having entries less than 10 years old, transaction dates before the company was setup, that the individual was alive when the transaction happened)
- reconcile with the events of the “real world” – that the actual event did happen (when an event is recorded make sure that “corroboratory” events did happen by bringing the additional data in or linking virtually to it.)
- propose/add new column types if warranted (if the customer bought a high-end item, did they also buy insurance etc – add a column for buy insurance (yes/no), or if data exists as unstructured blob types (images, text, audio, video), add columns to encode derived properties)
Once the data is reasonably organized, two types of “insights” may be potentially obtained – row specific insights and “inter-column”. An example of “row-specific” or instance-specific insight is: detecting fraudulent behavior by a customer – Everytime they buy something – they return it within two weeks. Here for a single customer, we collect all the transaction dates and return dates and “flag” if frequency of occurrence is beyond a threshold or significant beyond the norm. We fixate on a particular row identifier and characterize its behavior. An “inter-column” observation : for many customers, whose credit card is used beyond a certain radius of their default geographical location, many such transactions are identified as fraudulent after the event. One of the generalizations from such event histories is – if a card is used overseas, block its use (because it may have been stolen!). In a column-type insight, we characterize values in a column and attempt to relate with values in other columns. In every domain, we can potentially identify a number of such insights and develop approaches to detect them. However, establishing row-type insights is much less-stringent than column-type insights. In a row-type insight, we try to find instances that meet a criteria or satisfy a piece of knowledge. In a column-type insight, we try to discover a new piece of knowledge.
What happens if data is missing or censored to these two types of insights ? Both types are quite robust to missing data – as long as – we continue to capture new incoming data properly. Even though we may miss some instance-specific data, if the new data reflects the behavior, it will be detected in the near future. For example, in the above example, if a customer’s fraudulent behavior is below threshhold (because of erroneous data), it may not be detected the first time around, but possibly a few months later. It may also be possible to flag a customer whose behavior is within a delta of the threshhold and put them on a watchlist of sorts!. Depending on the behavior, one can also “impute” data (for example, if such behavior has been observed in customers in a certain income range and the customer under review lies in that income range), we can still potentially flag fraudulent behavior though the signal is below a threshhold. Heuristics and other techniques are applicable here.
Inter-column insights/relations/correlations are potentially more robust, considering that these are more “general” results. Think about it – Assuming it were true, Would it have mattered if instead of an apple, a pear or a plum or a banana fell on Newton’s (or for that matter any individual’s head or for that matter (instead of dropping from a tree it was thrown by somebody or dropped by a squirrel?). If the phenomena/event were to occur in the near future, it will be captured and the generalization detected. In the fraud example in the previous paragraph, the link between fraud and geography is established across a number of individuals (and is not row-constrained). Missing data does matter only if fraud is a very “small” number event. In such cases capturing every such event matters – build enough +ve events – to detect signals from noise. The key is to understand if “cleaning” data adds to the frequency of occurrence of such events. Cleaning data cannot add new events if the underlying phenomena did not capture it (if there is no fraud (yes/no) column, we cannot infer anything)! At web-scale, if the phenomena is bound to occur in the near future, it may not be worthwhile cleaning data.
The more I ruminate and based on work done thus far, it seems that it may be cost-effective and a good approach to start with whatever “clean data” you have – (Missing data or leaving out erroneous data functions as an implicit Occam’s razor) and then add incremental complexity to your big data model. You only pay for what is worthwhile (in terms of analytics resources (compute/storage/data scientists) and also identify – what are the proprietary insights – benefits for your business per se and actionable rather than something more cumbersome. Cleaning data for inferring relationships may not be worth the effort unless you are in the record-keeping business – the world is too noisy and fast changing – to make it meaningful. Data warehousing transactional events is important to keep a record of things that your organization did or did not do. However, using that same stringency for model-building may not be necessary – the focus of data capture is to enable model building with error. Furthermore, cleaning up past data may not be meaningful because the knowledge thus gleaned may not be relevant in the now (or future – though we may say “history repeats itself”), it may introduce erroneous inferences and distort our understanding in a dynamic environment. Also cleaned data may not reflect reality (as you have distorted your record of what actually may be happening in the realworld) and also introduce ghost artifacts. Overall, cleaning data should be undertaken on a large-scale provided, one has some notion of the potential information gain/utility (at a row-specific level or inter-column level) and measurable in some manner (consider that the world is inherently noisy) (This is an interesting optimization problem). Anyways, I will have more on this topic in due course as I continue to work on some interesting projects in multiple domains.