Big Data – Ponderings of a Practitioner – IV

Over the past few weeks, I have been reviewing the big data startup space – looking for companies with key ideas in exploiting big data. During this exercise, I noticed that a number of  start-ups focused on “data cleanup” tools and services. Though a useful category of tools (similar to ETL tools in the world of data warehousing), it raised the following question: Given the following theme underpinning big data  efforts: the need to obtain – new “generalizable” insights from rows upon rows of “instance level” data, Is it really that important to “clean up” your data ?  Why/Why not? What types of tasks in  “cleaning” data are worthwhile ? Continuing my series of big data related posts (I, II, III), here are my current thoughts on this topic.

For purposes of this discussion, let us assume that we have collated our “data sources” into a canonical table – Each row is an “instance” of the data element at hand, and each column is an attribute (property) of that instance – such as for a customer (a row), properties are: “address, age, gender, payment info, history (things bought, returned) etc.In such a conceptual model, what are the issues that may require “data cleaning” ?

  • fixing column types (should hex ids of customers be stored as strings or as longs ?)
  • fixing each entry in a cell (row/column) to confirm to type (removing errors in copying from source – age columns with text in them, artifacts of data modeling  –  converting ASCII text in varchar to unicode )
  • fixing each entry in a cell to correlate “properly” with entries in other cells for the “same” row (for example, the age column having entries less than 10 years old, transaction dates before the company was setup, that the individual was alive when the transaction happened)
  • reconcile with the events of the “real world” –  that the actual event did happen (when an event is recorded make sure that “corroboratory” events did happen by bringing the additional data in or linking virtually to it.)
  • propose/add new column types if warranted (if the customer bought a high-end item, did they also buy insurance etc – add a column for buy insurance (yes/no), or if data exists as unstructured  blob types (images, text, audio, video), add columns to encode derived properties)

Once the data is reasonably organized, two types of “insights” may be potentially obtained – row specific insights and “inter-column”.  An example of “row-specific” or instance-specific insight is: detecting fraudulent behavior by a customer – Everytime they buy something – they return it within two weeks. Here for a single customer, we collect all the transaction dates and return dates and “flag” if frequency of occurrence is beyond a threshold or significant beyond the norm. We fixate on a particular row identifier and characterize its behavior. An “inter-column” observation : for many customers, whose credit card is used beyond  a certain radius of their default geographical location, many  such transactions are identified as fraudulent after the event. One of the generalizations from such event histories is – if a card is used overseas, block its use (because it may have been stolen!). In a column-type insight, we characterize values in a column and attempt to relate with values in other columns. In every domain, we can potentially identify a number of such insights and develop approaches to detect them. However, establishing row-type insights is much less-stringent than column-type insights. In a row-type insight, we try to find instances that meet a criteria or satisfy a piece of knowledge. In a column-type insight, we try to discover a new piece of knowledge.

What happens if data is missing or censored to these two types of insights ? Both types are quite robust to missing data – as long as – we continue to capture new incoming data properly. Even though we may miss some instance-specific data, if the new data reflects the behavior, it will be detected in the near future. For example, in the above example, if a customer’s fraudulent behavior is below threshhold (because of erroneous data), it may not be detected the first time around, but possibly a few months later. It may also be possible to flag a customer whose behavior is within a delta of the threshhold and put them on a watchlist of sorts!. Depending on the behavior, one can also “impute” data (for example, if such behavior has been observed in customers in a certain income range and the customer under review lies in that income range), we can still potentially flag fraudulent behavior though the signal is below a threshhold. Heuristics and other techniques are applicable here.

Inter-column insights/relations/correlations are potentially more robust, considering that these are more “general” results. Think about it – Assuming it were true, Would it have mattered if instead of an apple, a pear or a plum or a banana fell on Newton’s  (or for that matter any individual’s head or for that matter (instead of dropping from a tree it was thrown by somebody or dropped by a squirrel?). If the phenomena/event were to occur in the near future, it will be captured and the generalization detected. In the fraud example in the previous paragraph,  the link between fraud and geography is established across a number of individuals (and is not row-constrained).  Missing data does matter only  if fraud is a very “small” number event. In such cases capturing every such event matters – build enough +ve events –  to detect signals from noise.  The key is to understand if “cleaning” data adds to the frequency of occurrence of such events. Cleaning data cannot add new events if the underlying phenomena did not capture it (if there is no fraud (yes/no) column, we cannot infer anything)!  At web-scale, if the phenomena is bound to occur in the near future, it may not be worthwhile cleaning data.

The more I ruminate and based on work done thus far, it seems that it may be cost-effective and a good approach to start with whatever “clean data” you have – (Missing data or leaving out erroneous data functions as an implicit Occam’s razor) and then add incremental complexity to your big data model. You only pay for what is worthwhile (in terms of analytics resources (compute/storage/data scientists) and also identify – what are the proprietary insights – benefits for your business per se and actionable rather than something more cumbersome. Cleaning data  for inferring relationships may not be worth the effort unless you are in the record-keeping business – the world is too noisy and fast changing – to make it meaningful. Data warehousing transactional events is important to keep a record of things that your organization did or did not do. However, using that same stringency for model-building may not be necessary – the focus of data capture is to enable model building with error. Furthermore, cleaning up past data may not be meaningful because the knowledge thus gleaned may not be relevant in the now (or future – though we may say “history repeats itself”), it may introduce erroneous inferences and distort our understanding in a dynamic environment. Also cleaned data may not reflect reality (as you have distorted your record of what actually may be happening in the realworld) and also introduce ghost artifacts.  Overall, cleaning data should be undertaken on a large-scale provided, one has some notion of  the potential information gain/utility (at a row-specific level or inter-column level)  and measurable in some manner (consider that the world is inherently noisy) (This is an interesting optimization problem). Anyways, I will have more on this topic in due course as I continue to work on some interesting projects in multiple domains.

 

Big Data – Ponderings of Practitioner – III

My earlier posts on this topic have taken a top down view – from model building to implementation. In this post, I attempt to understand the bottom up view of this world – the vendor landscape and tools in this space. Good summaries of this view are provided in:

a) Stonebraker’s post on Big Data – This takes the view – everything is data- It is large in volume, coming at great rates in realtime (velocity), and it is of different “types”/formats/semantics – variety. How do we support the conventional CRUD operations and more on this ever increasing dataset?
b) The vendor landscape as provided in a) , b) and a nuanced view of b) in c). In a) the big data world is seen as a continuation of the conventional world of database evolution in the past three decades evolved to include unstructured and streaming data, video,images and audio. b) and c) view it from the positioning of different “tech” buckets – each focused on “improving” some aspect of an implementation.

c) The analytics services view : Every worth while realworld application has some or all of the pieces of this architecture. One can pick a variety of tools for each component (open source or proprietary), jig them in different ways, use off-the-shelf tools such as R and SAS and more to analyze the data.

As I review the tools in this space, it is important to understand that these vendors value proposition is not to solve your “big data” problem – relevant to your business but to sell tools. Only after resolving the issues from a top down perspective, one can even constrain the technology choices and evolve the final solution incrementally. Vendors do not know your domain or your final application – so they cannot be held responsible. There are startups evolving in this space adopting  either the horizontal tech view (db tool, visualization tool, selling data) or the vertical view – solving a specific problem in a vertical – say marketing/advertising/wellness etc. For example, Google is a big data company dealing with advertising at scale (vertically oriented) – they built the big data toolkit to solve a vertical problem. Amazon’s recommender system is another application of big data at scale – for books (and later extended to other products).

Betting on a vertically oriented view has better odds since the key to getting value out of big data is “model building”. Model-free approaches to big data – free ranging analyses of data, tech investments without a well-bounded/specific purpose – are more or less bound to fail. Worthwhile/reliable models do not emerge out of the blue in any domain – it requires work. The business advantage is you get to exploit the benefits of the model till someone else figures it out which is how all science/tech works. So the key question is how does a tech leader do an “evaluative” project – that provides some guidance on big data investments given limited resources? I will have some thoughts on this in future posts on this topic.

Big Data – Ponderings of a Practitioner – II

Continuing the vein of thought outlined in my earlier post on Big Data, the key issue to have in mind before jumping into big data is: Are you collecting/analyzing data to build/improve a theory? or you have a theory that is guiding what data to collect so as to validate the theory or its predictions? Does the data come first, theory later or vice-versa? IMHO, as a practitioner, you need to go back-and-forth between the two views as they are two sides of the same coin – one guides the other.  Either view can be the starting point. Without Tycho Brahe’s data (the Rudolphine Tables), Kepler could not have formulated his laws. However,  Tycho Brahe was guided in his astronomical observations (what data to collect/tabulate?) by the goal to disprove/modify/improve the Ptolemaic/Copernican theory of the solar system. (As an anecdote, a recent data source is named Tycho in the field of medicine) However, one has to be careful in the context of the claims being made – Is the result of the big data analysis exercise an improvement to current (or extant) theory (science – new theory) or to practice (engineering – better calculation/accuracy/analysis in a specific context). This depends on the “generality” of the claim.  These issues have been well-discussed in the following resources (and it is worth the time spent reading them including the comment stream!)

  • The Chomsky/Norvig discussions a) and b). Though linguistics centric, the discussions  do address the big picture.
  • The Fourth Paradigm  – the book in memory of Jim Gray – discusses the interplay of big data and science
  • Beautiful Data – explores applications of big data

The data versus theory issues also have been discussed in the context of science and causality – Feynman provides a viewpoint in his lectures on Gravitation (See Section 7.7 – What is Gravitation?). Sometimes we do not need a machinery – as long as we can do more with the appropriate abstraction.

This issue of why one is pursuing Big data has to be clear apriori especially given the huge investments required in terms of big data skills, tools and other resources. In a way, every organization is going to have to setup  an in-house applied R&D group. In the context of businesses, it is worthwhile to ask: Are we doing enough with the “small” data that we have? As outlined in this recent HBR article, businesses can go a long way with whatever tech. investments they have made thus far.

Wearing the data scientists hat, How do you evaluate if a given problem is a worthwhile “big data” problem? Here are some key questions to ask to guide the thought process:

Model/Data Centric:

  • What is the basic hypothesis – for which analyzing more data will help? (For example, customer “shopping” behaviors are similar! so we collect enough data to validate this hypothesis and then predict a new unseen customer’s shopping behavior) Do you really need more data to make a decision ?(Consider the classic newsboy problem from OR) May be myopic rules work well enough in most dynamic situations!
  • Is the underlying phenomena(?) that generates this data stable, repeatable? Or is the data the amalgamation of multiple “generative” processes?(For example, traffic patterns during Thanksgiving are a result of more vehicles, heterogeneous drivers and vehicle mix (old cars, new cars, trucks)) Can you break the final observed pattern in terms of the causal components (behaviors of each constituent mentioned above)? What are the units of analysis – for which you need to collect data (longitudinally/temporally) , spatially or otherwise? Do you have the data in-house? What needs to be acquired from other parties and validate? What needs to be collected in-situ?
  • Is the theory that guided the “data collection”  well-developed? What was the objective behind the data collection effort that built the primary data source? (For example, census data is collected to know the number of folks living in a country (at a certain time point). By the time data is collected, folks are born and die, so what %age error is introduced by this process in your actual estimate at a later date?We can use this census data to “derive” a number of other conclusions – such as growth rates in different age groups). Furthermore, the objective introduces “biases” implicit in the data collection processes (issues such as over-generalization, loss of granularity etc.)
  • What is the effect of considering more data on the quality of the solution (possibly a prediction) ? Does it improve or degrade the solution? How do we ensure that we are not doing all the hardwork for no “real” benefit (though we may define metrics that show benefit!). Censoring the data is critical so that one analyzes phenomena in the appropriate “regimes”.
  • How do you capture the domain knowledge about the processes behind the “data”? Is it one person or a multi-disciplinary team? How do real processes in the domain affect the evolution of the data?
  • What does it mean to integrate multiple sources of data from within the domain and across domains? If across domains, what “valid” generalizations can one make? – these are your potential hypotheses. If conclusions are drawn, what are the implications in each source domain? What are the relationships between the different “elements” of the data? Are the relationships stable? do they change with time? When we integrate multiple sources of data, what are the boundaries – in representation, in semantics? How do we deal with duplicates, handle substitutes, reconcile inconsistencies in source data or inferences? These issues have been studied in the statistical context – called multiple imputation.
  • What do summaries of data tell you about the phenomena? Does the data trace collected cover all aspects of the potential data space or is it restricted to one sub-space of the phenomena? How do you ensure you have collected (or touched all aspects of this multi-dimensional data space)? Are current resources enough – How do you know you have all the data required? (even in terms of data fields?) ? What more will you need?
  • Overall, during the datafication phase – what are the goal(s) of the analysis? Are the goals aligned or in conflict? what kind of predictors are you looking for? How will one measure quality of results? Can we improve incrementally over time?

Computation/Resource Centric:

  • How many records are there to start with? (A record is for our purposes a basic unit of “data” – possibly with multiple attributes) – a row in a database table. What is changing about these records? How quickly? Is the change due to time? space? a combination of the two? Other extraneous/domain-specific processes (such as a human user or sensor or active agent)?
  • When combining data from multiple sources – how do you “normalize” data ? semantics? reconcile values/attributes? test for consistency?
  • How do you capture the data? What new tools do you need? Is the “sampling” of the data enough ? (when you discretize the analog world, understanding the effects of this essential). For example, when you sample some time phenomena at 1 sec intervals, you cannot say anything intelligent about phenomena that have a lifetime of less than 1 sec and fall between two sample points (a second apart). For example, census is collected every ten years. Are there population phenomena in the intervening decade that we miss? Do we even know what these are?
  • The engineering questions – Hadoop/map-reduce or regular DB alternatives, home grown tools, batchmode versus realtime, visualization (what to visualize and how?), rate of data growth? What data needs to be stored versus gotten rid of? merging/updating new data and dependent inferences? How to integrate resultant actions (in a feedback loop)? How do you figure if the feedback is positive or negative?

Process/Validation Centric:

  • How do we make sure the overall analysis is moving in the right direction? We need to ensure that the phenomena under study is not changing by the time we analyse the data.
    How do incrementally validate/verify the results/predictions? How do we checkpoint periodically?
  • What is the overall cost/benefit ratio of the whole analysis exercise? Is the benefit economically exploitable in a sustained manner?
  • How do we hedge and use the intermediary results for other “monetization” activities?

In the ensuing posts, I will discuss the overall process in the context of a project, I am current working on. Overall, investing in big data requires a good bit of upfront thought or else every organization will have large R&D expenses without any real ROI!

Social Media related R&D problems

Over the past two years and a half, I have been working closely on social media data from Twitter, FB, Google+ and others. After building and using real world systems that need to scale and having read a number of academic articles and noodling with prototype systems, much remains to be done. As I went through the process, I gathered a  working list of worthwhile R&D problems which I’d like to share below.

Computational Linguistics/Semantics problems

How does the “social media” language evolve? Twitter’s 140 character restriction drives a certain type of conversation. FB’s setup allows a different type of register. Tumblr is longer in nature. SMS and WhatsApp conversations are different. How do these conversations look in personal versus professional contexts? Comment chatter on blogs and topical sites is different (with an implicit theme/context). How do these affect language? (evolution of acronyms, social conventions – the back & forth, cues for follow up etc.) How do we detect these post-facto or real-time? What is the nature of one-to-many conversations and nature of “conversational threads” in these different registers? How does Entity recognition evolve to process social media data? How can Relationships, Actions, Events be detected from Social media data? How can sentiment/tone analysis be improved ? How does each of these look in different languages? How can you generate “social media” dialogues (moving beyond chatbots and the like)? How do you do supervised learning@scale ? What can be learnt via labeling via crowdsourcing versus what concepts cannot be? How do you understand aka infer “context” – what does “context” really mean? Is the current Geo  labeling (approaches) enough?  What kinds of sharing, response behaviors does one see and how are they cued (in text)? Is behavior similar across multiple languages? cultures? (that is do folks “chatter” the same way in English versus Spanish or other cultures?) How does Social media chatter leverage or enable “search” behavior? What if Twitter/FB had come up before Google for search? What behavioral/opinion changes can we detect from social chatter in space and time ? public opinion versus individual opinion?

Media Type problems

How do images/video/audio interplay with text? Role of Image recognition/object id/scene id in the context of social conversations (aka Instagram photos or Pinterest) ? How would we organize audio comments on Pandora ? Stitch images into an animation? How would we cluster “images” ? What kind of fingerprinting technologies can be used to “bundle” or group things together?

User-related problems

What kind of “profile” based inferences can one make from public social profiles and chatter? What kind of “persona” discovery? How can we link identities across social channels? (entity linking problem) How can one link offline/online “profiles”? What kind of “accounts” are spam or spurious accounts? What kind of “chatter” is spam?  What kind of chatter is generated by a bot? How can we use “geo” info  to know more about a user, or infer geo info about a user from their chatter  or provide content to a user based on their geo info? How does one generate “communities” – by similarity amongst users on what dimension(s)?

Content-related problems

How do we organize User-generated free form content (UGC)? Topics (Clustering and the like, how good are the approaches?) How do we link UGC with professionally created  content? How do we use social chatter to guide “recommendations” ? in what contexts? How do we categorize content? generate taxonomies auto-magically or update manually curated taxonomies on the leaves of a core structure?

Advertising Eco-system problems

What does it mean to advertise socially? Paid media attribution problems – which channel worked for which kind of content at what time for what kind of user? How should the ad copy/message look like? How does social earned media  interplay with paid media advertising – say display advertising and PPC advertising. How should publisher’s promote content in social media ? Which order ? What ad copy? Which snippet? How should they link different online/offline properties?

Others

How to identify, track, secure and protect content sharing –  paid/free text, images, video? (digital rights, watermarking, analytics at scale). How does “Social” interplay with TV ? (aka like Twitter for large sample feedback on a show or ad) Are the statistical inferences really valid (Is Twitter really representative of the larger population)? How would you test for the same? What kind of surveys/experiments/probes can you run real-time to drive/guide an inference – automated design of experiments ? How do we “simulate” Twitter/FB behavior (structurally and “dynamically”) ? What “dynamics” can we model in such systems? What kind of “actions” can one suggest from such simulations? (Power law behaviors). Standard problems such as spam detection, de-duplication, link duplication, content segmentation, summary generation in the context of social media chatter.

 

Big Data – Ponderings of a Practitioner – I

Having worn multiple hats related to dealing with data on a near-daily basis through my professional career – data modeling, schema/representation development,  redesign for indexing and query performance, information extraction, feature extraction for machine learning etc. I have been following the evolution of data modeling/analytics from just data to “big data” over the past two decades. Curious over the recent crescendo, I just finished reading two books addressed to the laymen, business practitioners/CXO’s, investors, TED attendees and the folks looking to ride the next buzz (now that Twitter’s IPO is done !).  and just for sanity, an old paper which I had read quite a while ago.

These books are:

a) Nate Silver’s The Signal and the Noise: Why So Many Predictions Fail-but Some Don’t . (For those interested, a quick review from an eminent statistician)

b) Big Data: A Revolution That Will Transform How We Live, Work, and Think  by Viktor Mayer Schonberger and  Kenneth Cukier 

and the paper (still relevant after a decade ) by Prof. Leo Breiman.

These books raised a number of related thoughts which I would like to share and follow up in greater detail  in future posts. Before getting to  those, I thought I’d share a summary of the above material which frames some of the issues in the “Big Data” movement in terms of my own paradigm:

Big data is “old wine” in a new bottle. So what’s changed ? The belief is:  We are producing/gathering data faster, hence can we drive the cycle of empiricism (aka the scientific method) faster.  The idea is to reduce the time span to  a) formulate theory, b)  collect data, c) develop a model or models,  d) predict/infer based on a model and test, and finally, e) revise beliefs in a), thus leading to the next cascade of activities. Why do we want to this ?  Schonberger et al. posit that: We want answers/benefits now, we believe since there is more data (we should do better but on what?), We really do not need answers to the question Why but just What is (so in this case what does it mean to predict something when I can look it up in a table). To build such a table (along multiple dimensions), I need to digitize/quantify/represent the world (aka datafication). Given the data (of different aspects of the world), we can combine, extend, reuse, repurpose the same to support other theories (than the original intent the data was collected for). However, it is user beware when one does this because when this is done, the models or inferences are not tested under all possible usage contexts. Furthermore, there is a range of  new “risks” – personal privacy violations (in context of consumer data) and other (organizational intelligence, cyber-espionage) in the context of how many touch points are to the data and in how many contexts it is utilized and who controls the data and its uses. Additionally, another key viewpoint raised is we do not need “general theories” that apply to a large number of data points but a number of “small theories” that address small samples is good enough (where the small samples are defined contextually). With the availability of computing technology, we can work with a number of small sample theories to solve “small problems” which are large in number (hence Big data!).  When you really think about it, in the best (or is it worst?) case scenario, each “realworld” individual is a small sample – the grail of “personalization”.

Silver in his book focuses on the process of using data – promoting a Bayesian worldview of updates on beliefs to distinguish the “signal” from the “noise”. However, his discussions are framed in a frequentist perspective. To perform updates on beliefs or even apriori discriminate signal from noise – one needs a model – the baseline. He discusses the limitations of models in different domains. See Table below.  The rows indicate if the primary domain models are deterministic (Newton’s laws, rules of Chess) or inherently probabilistic. The columns indicate the completeness of our current knowledge – do we have all the models for phenomena in those domains.

 Complete Domain KnowledgeIncomplete Domain KnowledgeComments
DeterministicAnalog - Weather
Digital - Chess
Analog - Earthquake
Digital - Worldwar II events, Terrorism events
The term "Digital" is used to mean "quantized", "Analog" - refers to the physical world
Probabilistic/StatisticalDigital -Finance,Baseball,Poker
Analog - Politics
Digital - Economy, Basketball
Analog - Epidemics, Global Warming

Improving the models or making a prediction requires analysis of data in context of the model. The book highlights the notion of  “prediction” –  saying something about something in the future. However, it sheds little light on  “inference” – saying something new about something at present – thus “completing ones incomplete knowledge”. However, the sources of uncertainity are different in the models – A deterministic game like Chess has uncertainity introduced because of player behavior and the search tree whereas earthquake prediction is uncertain because we just do not know enough (our model is complete). The book however makes no statements on the role of “Big data” per se in terms of tools (all the analysis in the book can be done with spreadsheets or on “R”). Furthermore, the book highlights the different types of analyst “bias” that may be introduced in the inferences and predictions.

In contrast to the books, the paper makes a case for a change in the approach to professional  model building  by statisticians. Instead of picking the model and fitting data, let the data pick the model for you. Though if one is a professional statistician, one would search through the space of potential models (via a generate-and-test approach) and finally picking the appropriate one on some criteria. Considering the premise of this paper, one can see a potential use-case for big data in the model building lifecycle reconciling the different discussions in the books.

Will have more on this topic in future posts.

Addendum: See the latest article on this topic in CACM.