Big Data - Ponderings of a Practitioner - II

Continuing the vein of thought outlined in my earlier post on Big Data, the key issue to have in mind before jumping into big data is: Are you collecting/analyzing data to build/improve a theory? or you have a theory that is guiding what data to collect so as to validate the theory or its predictions? Does the data come first, theory later or vice-versa? IMHO, as a practitioner, you need to go back-and-forth between the two views as they are two sides of the same coin – one guides the other. Either view can be the starting point. Without Tycho Brahe’s data (the Rudolphine Tables), Kepler could not have formulated his laws. However, Tycho Brahe was guided in his astronomical observations (what data to collect/tabulate?) by the goal to disprove/modify/improve the Ptolemaic/Copernican theory of the solar system. (As an anecdote, a recent data source is named Tycho in the field of medicine) However, one has to be careful in the context of the claims being made – Is the result of the big data analysis exercise an improvement to current (or extant) theory (science – new theory) or to practice (engineering – better calculation/accuracy/analysis in a specific context). This depends on the “generality” of the claim. These issues have been well-discussed in the following resources (and it is worth the time spent reading them including the comment stream!)

The Chomsky/Norvig discussions a) and b). Though linguistics centric, the discussions do address the big picture.
The Fourth Paradigm – the book in memory of Jim Gray – discusses the interplay of big data and science
Beautiful Data – explores applications of big data

The data versus theory issues also have been discussed in the context of science and causality – Feynman provides a viewpoint in his lectures on Gravitation (See Section 7.7 – What is Gravitation?). Sometimes we do not need a machinery – as long as we can do more with the appropriate abstraction.

This issue of why one is pursuing Big data has to be clear apriori especially given the huge investments required in terms of big data skills, tools and other resources. In a way, every organization is going to have to setup an in-house applied R&D group. In the context of businesses, it is worthwhile to ask: Are we doing enough with the “small” data that we have? As outlined in this recent HBR article, businesses can go a long way with whatever tech. investments they have made thus far.

Wearing the data scientists hat, How do you evaluate if a given problem is a worthwhile “big data” problem? Here are some key questions to ask to guide the thought process:

Model/Data Centric:

What is the basic hypothesis – for which analyzing more data will help? (For example, customer “shopping” behaviors are similar! so we collect enough data to validate this hypothesis and then predict a new unseen customer’s shopping behavior) Do you really need more data to make a decision ?(Consider the classic newsboy problem from OR) May be myopic rules work well enough in most dynamic situations!
Is the underlying phenomena(?) that generates this data stable, repeatable? Or is the data the amalgamation of multiple “generative” processes?(For example, traffic patterns during Thanksgiving are a result of more vehicles, heterogeneous drivers and vehicle mix (old cars, new cars, trucks)) Can you break the final observed pattern in terms of the causal components (behaviors of each constituent mentioned above)? What are the units of analysis – for which you need to collect data (longitudinally/temporally) , spatially or otherwise? Do you have the data in-house? What needs to be acquired from other parties and validate? What needs to be collected in-situ?
Is the theory that guided the “data collection” well-developed? What was the objective behind the data collection effort that built the primary data source? (For example, census data is collected to know the number of folks living in a country (at a certain time point). By the time data is collected, folks are born and die, so what %age error is introduced by this process in your actual estimate at a later date?We can use this census data to “derive” a number of other conclusions – such as growth rates in different age groups). Furthermore, the objective introduces “biases” implicit in the data collection processes (issues such as over-generalization, loss of granularity etc.)
What is the effect of considering more data on the quality of the solution (possibly a prediction) ? Does it improve or degrade the solution? How do we ensure that we are not doing all the hardwork for no “real” benefit (though we may define metrics that show benefit!). Censoring the data is critical so that one analyzes phenomena in the appropriate “regimes”.
How do you capture the domain knowledge about the processes behind the “data”? Is it one person or a multi-disciplinary team? How do real processes in the domain affect the evolution of the data?
What does it mean to integrate multiple sources of data from within the domain and across domains? If across domains, what “valid” generalizations can one make? – these are your potential hypotheses. If conclusions are drawn, what are the implications in each source domain? What are the relationships between the different “elements” of the data? Are the relationships stable? do they change with time? When we integrate multiple sources of data, what are the boundaries – in representation, in semantics? How do we deal with duplicates, handle substitutes, reconcile inconsistencies in source data or inferences? These issues have been studied in the statistical context – called multiple imputation.
What do summaries of data tell you about the phenomena? Does the data trace collected cover all aspects of the potential data space or is it restricted to one sub-space of the phenomena? How do you ensure you have collected (or touched all aspects of this multi-dimensional data space)? Are current resources enough – How do you know you have all the data required? (even in terms of data fields?) ? What more will you need?
Overall, during the datafication phase – what are the goal(s) of the analysis? Are the goals aligned or in conflict? what kind of predictors are you looking for? How will one measure quality of results? Can we improve incrementally over time?

Computation/Resource Centric:

How many records are there to start with? (A record is for our purposes a basic unit of “data” – possibly with multiple attributes) – a row in a database table. What is changing about these records? How quickly? Is the change due to time? space? a combination of the two? Other extraneous/domain-specific processes (such as a human user or sensor or active agent)?
When combining data from multiple sources – how do you “normalize” data ? semantics? reconcile values/attributes? test for consistency?
How do you capture the data? What new tools do you need? Is the “sampling” of the data enough ? (when you discretize the analog world, understanding the effects of this essential). For example, when you sample some time phenomena at 1 sec intervals, you cannot say anything intelligent about phenomena that have a lifetime of less than 1 sec and fall between two sample points (a second apart). For example, census is collected every ten years. Are there population phenomena in the intervening decade that we miss? Do we even know what these are?
The engineering questions – Hadoop/map-reduce or regular DB alternatives, home grown tools, batchmode versus realtime, visualization (what to visualize and how?), rate of data growth? What data needs to be stored versus gotten rid of? merging/updating new data and dependent inferences? How to integrate resultant actions (in a feedback loop)? How do you figure if the feedback is positive or negative?

Process/Validation Centric:

How do we make sure the overall analysis is moving in the right direction? We need to ensure that the phenomena under study is not changing by the time we analyse the data.
How do incrementally validate/verify the results/predictions? How do we checkpoint periodically?
What is the overall cost/benefit ratio of the whole analysis exercise? Is the benefit economically exploitable in a sustained manner?
How do we hedge and use the intermediary results for other “monetization” activities?

In the ensuing posts, I will discuss the overall process in the context of a project, I am current working on. Overall, investing in big data requires a good bit of upfront thought or else every organization will have large R&D expenses without any real ROI!