Ruminations on AI, ML, Data Science – where are we?

This is the first post after a long hiatus from active blogging. Having spent the past nearly 3+ years running a tech – (all of product, data and engineering) – team out of Bangalore, I have learnt much as  how things are evolving in India and hopefully will get the opportunity to share the same on this thread. In the meanwhile, over the past couple of weeks, I have spent some time looking at the whole evolution in ML, AI and its sister disciplines over this time frame, just to baseline myself. Shown below is a proxy indicator for “popularity” of these terms over this time period (Please click on the image for a magnified view).

Overall, these trends suggest lots of activity in the “enabling with AI”  space across many application domains. One key thing to note is the downward trend in these graphs over the past quarter.  Is this a leading indicator of things to come? Too early to call. As things stand, everyone in tech is doing some sort of ML or data science or wants to do something with data. We have also seen lots of venture funding and a lot of innovation, entrepreneurial and tech activity over this time frame.

Firstly, what good things have come out during this period (perhaps my biased view) ?

  1. Deep Learning as a viable tool in the AI toolkit – lots of libraries, custom built hardware – chips and GPUs, a better understanding of where it works – image/video processing, speech, text – and where it does not and a much clearer focus on what are the really hard problems to focus on, the need for explainability, the costs of compute doing data intensive AI etc. The key thing about this journey is a better understanding of the complexities of building AI systems – even for R&D purposes.
  2.  Improvements in Basic Modeling around Causality – For those who have been following the research around Uncertainity and AI – the basic structuring and formalization of Causality is a key first step in building complex reasoning systems. We finally have some kind of scaffolding to reason with data and build more reliable “chunks” of knowledge. Deterministic and Probabilistic/data-driven knowledge may be combined effectively.
  3. Automated Driving and other applications of AI – as folks in a variety of domains adopt AI/machine intelligence etc. More experimentation with the tools of AI builds an awareness and experience of the complexities of AI. Domains of application range from agriculture, computational in-silico biology, health care to many more.
  4.  Improvements in hardware –  Availability of advanced chips, custom compute/data acquisition devices, sensors, chips for signal/image/speech processing, drone technologies, cheaper/smaller/energy efficient data/power storage tech., have made many real world problems tractable in an engineering sense. Investments required to build heterogeneous platforms to solve realworld problems are reasonable. AI systems can utilize these platforms for data/knowledge acquisition and interaction with other systems and users. Improvements in components, communication devices, mechanisms, new materials etc. are driving advances in different kinds of “robotic” – aka mechatronic systems. What is a robot versus a “mechatronic” system ? the boundary between these two categories is the amount of autonomy that the system has embedded in it.
  5. AI – software toolkit “commoditization” – Advances in AI toolkits, development libraries, Cloud infra – AWS, GCP, Azure – has helped teams focus on building the intelligence. SW systems engineering is a minor issue  in the basic proof-of-concept phase of the product development lifecycle. Teams can focus on “capturing” and modeling the intelligence as SW has matured to a reliable point. AI APIs help one at least evaluate the viability of an idea.
  6.  Latent Commercial Potential – Embedding intelligence in any process promises a variety of economic and related benefits – either for a consumer or an enterprise.  Under the rubric of AI for the Enterprise – a number of initiatives are being pursed such as – 1) the advent of “chatbot” technology/platforms for intra and inter-organizational activities,  2) advanced data exploration and visualization technologies – as enterprises focus on exploiting data for better decision-making , 3) Investments in Big data infrastructure and teams – as companies seek a better understanding of how to use AI technologies internal and external to an organization. Much of this has been driven by expanding consumer expectations – (driven by advanced voice/speech/video technologies) –  and due to the availability of large amounts of proprietary data – internally and externally in organizations. Also, large scale AI R&D has moved to quasi-commercial and commercial entities rather than being purely academic (such as the Allen Institute and others), and the Chinese (private + public) initiatives.

However, there are a few key issues, that are yet to addressed in a concrete manner and only slowly being understood. These are :-

  1. Misunderstanding the Role of Data  versus Role of Knowledge in building Intelligent Systems – This is a very important problem. Current ML/Deep learning driven approaches implicitly believe – given enough data in any domain, “black-box” systems can be learnt. The primary motivation for the rise in ML-based systems was the reason that “knowledge” need not be painfully encoded into a system but can be seamlessly learnt from data. Since it is learning from data, the system will be less brittle. Unfortunately, as folks are realizing even data-driven systems are highly brittle – they are “knowledge weak” – and one does not even know where.  So where do we go from here ? How do be we build systems that accommodate both and evolve as each type of “knowledge” evolves ? It is also important to remember that when we identify something as “data” – there is an implicit model behind it. For example, even a “concept” such as an “address” – has a mental model – aka knowledge. Additionally, intelligence can be encoded in the “active agent” or in the environment or a combination of both. For example, guidance strips to help AGVs navigate on the shop floor is structuring the environment. Building a more general purpose AGV that uses LIDAR and sensors is encoding intelligence in the active agent. It is apriori unclear what should be done with respect to these choices in most domains.
  2. Complexity of Collecting Data and Ground Truth for different phenomena – Most supervised learning approaches need data collection at scale. Also required is labelled data at scale which is of reasonably high quality. Developing a viable approach to do this is a highly iterative process. Crowd-sourcing approaches only help so much – may be they are good for “common-sense” tasks, but may be totally infeasible for “specialized” tasks – where background knowledge is required.  How does one apriori know what kind of approach should one take – purely data driven versus knowledge driven ?  What kind of “knowledge gathering” activities should one do?
  3. Engineering without the Science –  – the growing divide – Capturing intelligence from data – as models/rules etc and validating them  for stability/robustness etc is an extremely difficult problem. Engineering as I mentioned is the easy part.  In most organizations, engineering progress can be shown with minimal effort whereas data science/AI work takes time. The organizational divide that is engendered by this is a major issue. It is important to realize that without the intelligence in a domain codified, driving your product evolution based on engineering metrics is counter productive.
  4.  Are we setting ourselves for an AI winter or  AI ice-age ? – With the amount of buzz in popular media around AI technologies and the huge investments being made, it is unclear if time horizons to solve these problems are being estimated properly. We have started seeing signs of major failures in many domains, but these are early days. Both optimists and pessimists abound but the reality is somewhere in-between.  There is much we still do not know and have yet to discover – what is intelligence ? how is it in biological systems? How is it to be captured and engineered in artificial systems? How is intelligence defined/modeled for an individual versus for a group ? How does intelligence evolve ? and many more. As society embarks on this quest – it is important that we do not throw out the baby with the bathwater – we do not even recognize the baby – in many of these situations. However, one thing is certainly clear – as things stand – we understand very little about “intelligence” – human or otherwise. Our understanding about intelligence will evolve considerably over the coming years and in the process – we will build systems – some successful/some not so much.

The aforementioned issues have been around from the early days of AI (since the 1950s) and the core issues around understanding the brain, modeling cognition (Does it even take place in the brain?), knowledge representation, Is one kind of knowledge enough for a task, How do measure/evaluate intelligence etc. require further basic research. Progress in basic research has been slow as the problems are extremely difficult. Applied AI may be on the right path but much engineering needs to be done – many systems need to be built in many domains and lessons learnt. However, this collective quest to build AI systems – may finally drive us to focus on problems of real value and real impact in the long run. It is not yet too late to join the journey. Every problem worth solving is still being looked at! More on these problems in upcoming blog entries.

Meteor.js – A Server Side Developer’s Experience

Over the past few months, I have been working with Meteor.js, henceforth called Meteor,  building out some complex UX workflows.  Being a platform architect and systems developer, I had shied away from working on product UI’s. However, sitting in innumerable meetings discussing the urgency of UI development and the difficulty of hiring good UI developers, I decided to jump in and contribute to our team’s UI effort. I was emboldened by my recent research into Meteor and its potential capabilities in comparison to Angular, React, Ember, Derby and other sub-frameworks if you will.  Herein, I summarize my experience and summarize some key lessons as my Meteor journey continues.

As I started getting into the amorphous world of UI development jargon with template languages,  HTML V, different javascript libraries, AJAX calls, asynchrony hell etc. there were a number of daunting subjects on top of my mind as a server side developer:

  1. HTML, HTML V, CSS related – What are the variations ? What can I do purely with HTML out-of-the-box ? What are the CSS issues ? What are the clean abstractions ? What should be in HTML, what in Javascript and what in CSS ? How do we deal with mobile and responsive design issues? How do we deal with browser-specific issues ?
  2. Javascript for the JAVA developer – Writing event-driven functional programs that run in a single thread is a bit different from writing conventional multi-threaded programs. Organizing the code-base, handling asynchrony via events, understanding how to control the flow of events such that the appropriate UI behavior is exhibited has to be done quite a bit differently then how one would do on the server side.
  3. MVC babel – Every UI framework talks about model-view-controller with agreement only on the “model” but varying notions of where the view and controller code execute, and how they are triggered. For a server side developer – the pattern of observers and observables is quite well known – and one wonders why all this confusion. Terminology such as data binding, controllers in different bits of code and server-side patterns being adopted for UI development leading to confusion.
  4. Disquiet with the loose, hacky nature of it all – Much of the doubt comes from i) Javascript – interpreted functional code – with no typing, complexity of the event models, prototypical objects (rather than classical java or C++ style OO), scoping – lexical versus dynamic, global versus local vars ii) the build process – how many javascript files should I include, where ? How do I package files, in what order ? How do I minify? iii) What goes on the client and what goes on the server side? Much of this is also exacerbated with  implementation of REST APIs  that differ in quality and abstractions.
  5. UI development cycle – How do I work with my design and HTML team ? Who does what on which piece ? How do we pull these different pieces together in every change cycle and feature iteration ? Folks with different backgrounds and deliverables needed to work on a tightly coupled artifact on tight deadlines with little slack for redo efforts. What are the dependencies on the code-base?
  6. Building mobile front ends –  Considering the amount of effort sunk into developing a good browser-centric UI, how does one transition relevant bits of the effort to a mobile experience ? What elements changes and what do not ? What is better done as a “native UI” experience? This is key for startups building out products – we need good enough front-ends in the shortest time-frame that encapsulate the product vision and experience to enable product testing and UX feedback ( See the recent article on Google Inbox development)

So how does/did adopting Meteor help in ameliorating some of the above issues:

  1. Meteor provides a clean way to decouple HTML, CSS and JS and handles all the file dependencies. Further with Blaze (the templating language) and the notion of events and helpers it helps a backend developer quickly get a user interface up and running quickly. The javascript dependencies are resolved when everything is compiled (though the depth-first priority order of files may be bit confusing).  A key aspect is the ability to store “state” inside Mongo and manage it effectively without worrying about how to access user state on client.
  2. Meteor also allows handling the javascript event model with a clean abstraction for linking synchronous and asynchronous calls on both server-side and client-side. The AJAX model can be simplified in a clean manner by constraining all out-going calls to be from the Meteor server side rather than directly from the browser. This allows for better development as the event-model can be debugged properly by constraining the state-evolution to occur only at one place – the underlying mongo collections.
  3. The MVC model can be mapped in a clean manner –  Model in the Mongo state, View in the Blaze templates, “Controller” – managed explicitly via state updates on collections (followed by) the implicit reactive updates. The state updates can happen synchronously or asynchronously and performed either by the user or by backend calls. This permits a cleaner conceptual analysis as UI flows get more complex. However, understanding reactivity is quite important (see more below).
  4. Meteor handles all the packaging and allows development of Single-Page Applications (SPAs) in a clean manner. Furthermore, it allows easy inclusion of third-party JS libraries, which provide a lot of functionally out-of-the-box. All the major libraries – jquery and others are supported out of the box. Also, it handles scoping variables and provides some basic design patterns (for the first time JS developer) to get going and build a functional UI without all the major pitfalls of generic JS development.
  5. By architecting the subsystems of the Meteor applications appropriately, it is possible for the a) app designer – who works on the HTML templates and CSS, b) the UI Javascript developer, c) the Server-side JS/API developer  and d) the Mongo/DB developer to work in “concert” to build out a feature all the way from UI through the backend – on the same code-base in an iterative, incremental manner. The different subsystems may run in different places on the network but every person works on the feature in a coordinated manner. This has enabled us to be more productive in a short time with minimal hand-offs between team members. Furthermore, with the low learning curve (assuming you have not picked up other mental models of javascript development),  team members learnt quickly as they could experiment and iterate quickly.
  6. Finally, as Meteor continues to evolve and with the recent integration of the Cordova toolset for mobile app development, there is a clean path to deploy and test your product on mobile devices. So from a product management perspective, if you architect the product properly, one can build for all form-factors and get into the test cycle early. UI quality out-of-the box is pretty good and the code-base is clean for the team to manage. Further supporting both android and iOS, it is a big time-saver from a product management perspective.

However, even with these benefits, there are a number of common pitfalls. Some of which I have run into and also gleaned from conversations with developers who have explored meteor after cutting their teeth in other frameworks.

a) Handling reactivity in Meteor – Understanding reactivity is key. When and how does it get activated ? What are the basic “reactive” variables out of the box (user, Session, Mongo collections) and how do they impact  your app design.

b) Moving large amounts of data between the client and serverside – Should you pull it into Mongo first and then subscribe on client or vice-versa? Should one even use Minimongo ? What are its limitations?

c) Using Subscriptions – How to effectively use subscriptions ? Subscriptions allow you to pull data into the client on demand (this minimizing how much data you have on your client). However, you also need to understand how to keep collections separate or share on minimongo. A single collection can power different UI views so when routes change, subscriptions also are reset. Much of this knowledge is evolving actively as developers use Meteor in different contexts.

d) Routing – IronRouter is the default router but behaves a bit differently from other off-the-shelf routers. It is a reactive router and affects subscriptions and other reactive parameters. Utilizing this along with session variables is key to building effective navigation in your app. Errors here can be costly for the app workflows.

e) Using Jquery and the associated event model – Using jquery patterns to locate elements and update element states has to be carefully thought through. Understanding of the interplay of Jquery and rendering updates is still evolving in the context of template.rendered callbacks.

f)  Organizing Blaze templates – Blaze templates can be organized granularly to make effective use of reactivity. Isolating different levels of updates of the UI can considerably speed up your app. This requires a bit of iterative template re-factoring and understanding your data to get it right.

Though Meteor considerably speeds up the UI implementation process, there are some key design choices that the UI architect has to make. Design choices include:

  • The shared collections between client and serverside and defining subscriptions/pubs to coordinate data flow (over the DDP protocol).
  • Handling session management in the app (with or without cookies).
  • Template layouts
  • Router Layouts and Use of Session variables to activate template renderings
  • Event models and standardizing on event behaviors

Having Meteor in the toolkit of a product manager (at a startup especially or otherwise) – enables you to think differently about how to build, deploy and test your product to get the early feedback. Many functionalities that may be considered back-end can be pushed into the Meteor layer and managed. Features can be added incrementally as user feedback is gathered and the product experience is fleshed out before scaling out. Meteor also enables incremental API development as you understand the units of “granularity” one may want to support. The basic meteor out-of-the-box can support testing on small audiences most of the core functionality of a product. Furthermore, much of this can be built out with small teams. From an overall perspective – either strategy from product to platform or platform to product can be effectively executed as you manage the “functional” footprint of the product – building the MVP to get traction. Furthermore, using Meteor does not preclude the use of Angular or React (two popular frameworks) – they can still be used within the Meteor framework in an incremental manner. So you start developing your product with least technological commitment and can evolve as your product-market fit evolves. In future blogs,  I will share some insights into using Meteor to power data-centric versus workflow versus realtime apps.

Building Product – Transitioning from a Service mindset

Many “realworld” ideas that initiate the launch of a startup have their roots in the world of “services”. Somewhere, someone who was trying to solve a problem for a customer realizes that there is a major need for a viable repeatable solution, their idea/approach to the process/solution method looks structured enough that it can be “productized” and hey presto, may be worth launching a startup. Most of us have seen this in verticals ranging from media, ad tech, analytics and more. However, this path of transitioning from relevant idea, to performing “service centric” projects, to building a worthwhile product is fraught with numerous pitfalls (confronted by startups in general) but there is something unique to startups where the core founding team has their basic upbringing and learn their ropes in a service context  such as consulting, running operations (sales, marketing, customer management), project/program management in any vertical. Over the past decade, I have worked in and with teams/peers from a range of such backgrounds and have learnt a lot from them. However, I would like to briefly highlight some of the key potential blind spots that may confront such service-mindset driven startup teams in defining a product, building the product, selling the product and supporting the product.

Defining what is the “product”:  Outlining the core features of  a product is a major exercise. What is that minimal set of features with some “tech” barrier that provides a baseline for future development and is still a viable MVP for customer engagement ? Given the service mindset, feature creep and the inability to draw a line between product features versus “services” is a major pitfall. The resolution is not to hire a product manager to manage the product, but the core team has to first internalize what makes up the “product” and its value proposition. If the boundary is not clear to the core team, a product manager cannot enforce such a boundary without proper conceptual constraints in the founding team. Though one may profess to following lean startup ideas and such, practising the same is quite difficult (reading a book versus actually doing it). The key to a successful product is to getting rid of unwarranted features as early as possible, (even before development!). Though you want to please the customer, You should decide on pleasing them with just one core set of themes/features/value, that others in the competition do not yet have and make it extremely easy to use that feature. This is your beachhead to build upon. Establish the “magic sauce” and its boundaries as early as possible with the current toolkit you have.

Building the product:  Most realworld problems require solutions that are complex and products have a number of features driven by different objectives – easy to engineer and maintain, fits in with the current wave of tech buzz, core beliefs of the founding team as what to have/not have, what does the competition have/not have and finally, what does the customer want, really need and wishes to have. Considering that coding/tech is the easy part once things are defined, the most important aspect is picking the right features/functionality at each layer of the “architecture” to support the initial core functionality. Making choices with many future needs in mind, overwhelms engineering as the dependencies mushroom making delivery of a decent working MVP difficult. It is an important issue to test if you want customer feedback before or after the first core version of the product is built. Leaning more towards the customer too early adds to the complexity and may mis-inform the team as to what to focus on. All customer input is not equal. Furthermore, though folks may have built tech earlier in their careers, the tech toolkit is constantly changing, old rules may not apply. Additionally, you need to be clear if the development efforts are for throwaway prototypes used to define the product or core aspects of the future product. Tech. teams need to be focused on core aspects rather than short term needs, depending on resource availability.

Selling/Marketing the product: Once the first revision of the product is beyond the demoable stage, selling the current revision for customer engagement, POCs  and initial adoption has to be done with discipline. A clear picture of what the product can do with current features and what are the benefits has to  defined. This has to be communicated to customers and the aim should be to find customers who want to work with what is on offer. Finding such customers is the difficult part. During this phase, it is easy to get misled by non product-centric customer requests (that do not improve the core product) and under revenue pressure take on service centric activities which are peripheral to the product development activity. The whole focus (even while attracting initial customers) is to refine product-market fit,  focus on identifying high value features/functionality in the product and the nature of customers who want such features. In a noisy market with many competitors, lookalikes, one is easily distracted with adding me-too features and providing non-core service functionality. Generating revenue or building customer base only becomes a focus after a stable beach head is established. In a service business, the quick cash flow is attractive, but in a product business it takes immense discipline to build the right product to get sustainable revenue else you falter. Jumping the gun without doing so usually does not help and burns resources.

Serving the customers – Solutions around the product: Once a core product is defined, built and deployed, peripheral derivations and solutions activity can consume resources in efforts to retain customers. It is important to be disciplined in choices of what features become core to the product and what are delivered in partnership with other third parties. May be the growth curve is muted because of such a choice but it may be worthwhile to do so in the near term, in order to develop the product to the next level of core functionality ahead of other incumbent products. One should also be comfortable to let customers go if they do not fit in with the core mission of the startup.  Avoiding dilution in focus is key. Every customer/partner is not equally important.

Overall, transitioning from a service mindset is very much possible and many good folks have done it. I see many of my peers trying to do so and run into the same aforementioned issues as they navigate the world of startups. Having a sense of what is important from a product perspective, taking the time to do so before resources are frittered away, having the discipline to stick to it is essential for success. Many interim strategies are possible such as: a) having a service arm in the organization to maintain incoming revenue b) outsourcing/offshoring the service org, c) partnering for services etc, but each of this has costly overheads (though you generate revenue to minimize funding risk).  Such overheads which drain a startup’s resources include: a) distraction between product and service focus, b) the cultures of a service organization and product organization differ remarkably in people, processes, expectations, incentives and c) as a startup your mission looks diffuse, your paths of growth look shallow/bushy and unnecessary complex.  Building a successful product company is an all-or-nothing proposition and accepting that early enough in a startups lifecycle can be highly beneficial.  Many of the aforementioned thoughts have emerged from my discussions with folks in big data analytics startups, media/ad -tech startups (producing content or running mini-agencies), product development offshore shops in mobile/web as they attempt to transition and redefine themselves. Hopefully, these musings help some entrepreneurs refine their thinking/strategy as pursue their vision.


Big Data – Ponderings of a Practitioner – IV

Over the past few weeks, I have been reviewing the big data startup space – looking for companies with key ideas in exploiting big data. During this exercise, I noticed that a number of  start-ups focused on “data cleanup” tools and services. Though a useful category of tools (similar to ETL tools in the world of data warehousing), it raised the following question: Given the following theme underpinning big data  efforts: the need to obtain – new “generalizable” insights from rows upon rows of “instance level” data, Is it really that important to “clean up” your data ?  Why/Why not? What types of tasks in  “cleaning” data are worthwhile ? Continuing my series of big data related posts (I, II, III), here are my current thoughts on this topic.

For purposes of this discussion, let us assume that we have collated our “data sources” into a canonical table – Each row is an “instance” of the data element at hand, and each column is an attribute (property) of that instance – such as for a customer (a row), properties are: “address, age, gender, payment info, history (things bought, returned) etc.In such a conceptual model, what are the issues that may require “data cleaning” ?

  • fixing column types (should hex ids of customers be stored as strings or as longs ?)
  • fixing each entry in a cell (row/column) to confirm to type (removing errors in copying from source – age columns with text in them, artifacts of data modeling  –  converting ASCII text in varchar to unicode )
  • fixing each entry in a cell to correlate “properly” with entries in other cells for the “same” row (for example, the age column having entries less than 10 years old, transaction dates before the company was setup, that the individual was alive when the transaction happened)
  • reconcile with the events of the “real world” –  that the actual event did happen (when an event is recorded make sure that “corroboratory” events did happen by bringing the additional data in or linking virtually to it.)
  • propose/add new column types if warranted (if the customer bought a high-end item, did they also buy insurance etc – add a column for buy insurance (yes/no), or if data exists as unstructured  blob types (images, text, audio, video), add columns to encode derived properties)

Once the data is reasonably organized, two types of “insights” may be potentially obtained – row specific insights and “inter-column”.  An example of “row-specific” or instance-specific insight is: detecting fraudulent behavior by a customer – Everytime they buy something – they return it within two weeks. Here for a single customer, we collect all the transaction dates and return dates and “flag” if frequency of occurrence is beyond a threshold or significant beyond the norm. We fixate on a particular row identifier and characterize its behavior. An “inter-column” observation : for many customers, whose credit card is used beyond  a certain radius of their default geographical location, many  such transactions are identified as fraudulent after the event. One of the generalizations from such event histories is – if a card is used overseas, block its use (because it may have been stolen!). In a column-type insight, we characterize values in a column and attempt to relate with values in other columns. In every domain, we can potentially identify a number of such insights and develop approaches to detect them. However, establishing row-type insights is much less-stringent than column-type insights. In a row-type insight, we try to find instances that meet a criteria or satisfy a piece of knowledge. In a column-type insight, we try to discover a new piece of knowledge.

What happens if data is missing or censored to these two types of insights ? Both types are quite robust to missing data – as long as – we continue to capture new incoming data properly. Even though we may miss some instance-specific data, if the new data reflects the behavior, it will be detected in the near future. For example, in the above example, if a customer’s fraudulent behavior is below threshhold (because of erroneous data), it may not be detected the first time around, but possibly a few months later. It may also be possible to flag a customer whose behavior is within a delta of the threshhold and put them on a watchlist of sorts!. Depending on the behavior, one can also “impute” data (for example, if such behavior has been observed in customers in a certain income range and the customer under review lies in that income range), we can still potentially flag fraudulent behavior though the signal is below a threshhold. Heuristics and other techniques are applicable here.

Inter-column insights/relations/correlations are potentially more robust, considering that these are more “general” results. Think about it – Assuming it were true, Would it have mattered if instead of an apple, a pear or a plum or a banana fell on Newton’s  (or for that matter any individual’s head or for that matter (instead of dropping from a tree it was thrown by somebody or dropped by a squirrel?). If the phenomena/event were to occur in the near future, it will be captured and the generalization detected. In the fraud example in the previous paragraph,  the link between fraud and geography is established across a number of individuals (and is not row-constrained).  Missing data does matter only  if fraud is a very “small” number event. In such cases capturing every such event matters – build enough +ve events –  to detect signals from noise.  The key is to understand if “cleaning” data adds to the frequency of occurrence of such events. Cleaning data cannot add new events if the underlying phenomena did not capture it (if there is no fraud (yes/no) column, we cannot infer anything)!  At web-scale, if the phenomena is bound to occur in the near future, it may not be worthwhile cleaning data.

The more I ruminate and based on work done thus far, it seems that it may be cost-effective and a good approach to start with whatever “clean data” you have – (Missing data or leaving out erroneous data functions as an implicit Occam’s razor) and then add incremental complexity to your big data model. You only pay for what is worthwhile (in terms of analytics resources (compute/storage/data scientists) and also identify – what are the proprietary insights – benefits for your business per se and actionable rather than something more cumbersome. Cleaning data  for inferring relationships may not be worth the effort unless you are in the record-keeping business – the world is too noisy and fast changing – to make it meaningful. Data warehousing transactional events is important to keep a record of things that your organization did or did not do. However, using that same stringency for model-building may not be necessary – the focus of data capture is to enable model building with error. Furthermore, cleaning up past data may not be meaningful because the knowledge thus gleaned may not be relevant in the now (or future – though we may say “history repeats itself”), it may introduce erroneous inferences and distort our understanding in a dynamic environment. Also cleaned data may not reflect reality (as you have distorted your record of what actually may be happening in the realworld) and also introduce ghost artifacts.  Overall, cleaning data should be undertaken on a large-scale provided, one has some notion of  the potential information gain/utility (at a row-specific level or inter-column level)  and measurable in some manner (consider that the world is inherently noisy) (This is an interesting optimization problem). Anyways, I will have more on this topic in due course as I continue to work on some interesting projects in multiple domains.


Text analytics – a practitioner’s view – I

After nearly a four month hiatus from blogging, I finally got around to writing up the first in  a series of blogs  on the state of art in text analytics. I have been interacting with various practitioners in both industry and academia on this topic over the past year. The general “notion” (atleast in industry) seems to be that text analytics is a nearly “solved” problem at least for most common commercial purposes (“Google does it” or “Netflix does it” or “FB does it” are some common refrains!). The belief seems to be open-source NLP codebases are good enough to build production ready systems.  IMHO, the field of text analytics is still wide open, many algorithms could be improved considerably and much remains to be done. As a practitioner in this space for more than a decade, I share  below some key observations as text analytics becomes a core component in a developers’ toolkit.

The key tasks are – traversing up the hierarchy from “character”-level analysis  to higher-level abstractions are:

  • Language encodings (UTF-8,ISO-8891, Unicode-16,Unicode-32 – knowing the encoding of the document is the first step!)
  • Language detection (different languages potentially use the same glyph – latin languages, cyrhillic languages share glyphs)
  • Spelling correction
  • Sentence boundary detection (When does a period signify end of sentence?)
  • Concordancing, Co-occurrence analyses
  • Tokenization
  • Lemmatization
  • Stemming
  • Part-of-Speech (POS) tagging
  • Word-sense disambiguation
  • Acronyms and other kinds of textual artifacts
  • Morphological analysis
  • Language modeling (n-gram analysis)
  • Corpus analyses, language evolution, empirical characterization
  • Chunking
  • Noun-phrase chunking, Verb-phrase chunking
  • Named Entity Recognition (NER)
  • Semantic Role Labeling
  • Relation extraction
  • Event extraction
  • Coreference Resolution (anaphora, cataphora, appositives)
  • Parsing – dependency, probabilistic CFG, clause detection, phrase analysis
  • Affect analysis – sentiment, sarcasm, metaphors and more, objective versus subjective/opinion classification
  • Intent analysis
  • Dialogue analysis
  • Semantic entailment
  • Categorization – text classification along many content/meta-data driven dimensions
  • Summarization (single document, paragraph, multi-document)
  • Author characterization based on text they create, different types of source analysis
  • Plagiarization, similarity detection, deception detection
  • Clustering – theme analysis and identification
  • Search, question answering, query expansion
  • Topic identification, topic tracking
  • Predicate analysis – document to graph (generating triples and other “semantic” structures)
  • Text synthesis – dialogue generation, response generation
  • Text translation – machine translation from one language to another
  • Text-image correlation
  • Text-to-speech and vice-versa (utterances introduce their own nuances)
  • Basic text data processing/data cleanup/normalization/standardization (generating data for mashups and more)

The aforementioned list is not comprehensive but addresses most of the core “NLP” tasks that form the basis of many commercial applications – recommender systems, social media analysis, document analysis, Ontology development, Semantic Web and more. Furthermore, for many real world contexts, many of these tasks need to be interleaved (possibly in different sequences), depending on the domain.  Additionally, approaches to these tasks also depend on the nature, aka “register” of the text such as:

a) written text versus spoken text
b) well-written editorials – with good grammar versus badly-written text, text with many pictures (sketches/diagrams and other visual aids)
c) Long documents (books/monographs/journals/legal cases/legal contracts/news reports) versus small documents (emails, blogs)
d) Short noisy text – Social commentary/conversations – social chatter, comment streams, command streams in emergency services, radio chatter
e) speech to text dialogs – voice complaints, voice mail, dialog transcriptions, closed caption text (in video), dictations (physician’s notes)

Furthermore, the text can be in different languages (or some documents can be in  mixed languages). Each language has its own “grammatical” substrate – SVO (subject-verb-object) versus OVS and combinations therein, based on the underlying “language” family. Furthermore, some languages have multiple “written” scripts (different set of “glyphs”), (such as Japanese kanji and kana). Furthermore- for the same language, accents, lingo, common idioms of use and such vary – geographically and culturally (imagine differences in English spoken in England versus the US versus India versus Australia). This adds another layer of complexity to the task. Also “real natural language” has many “special cases” affecting the semantics and pragmatic analysis of the text.

For each of the afore-mentioned tasks,  most worthwhile approaches include the following:

  • building a domain/context-specific rule-base/pattern-base that identifies and tags the data appropriately. Additional “knowledge” may be used – such as lexicons, pre-codified taxonomies and more.
  • collecting labelled training data and then training “classifiers” (for both sequential/non-sequential data) – linear classifiers, SVMs, neural nets and more. Labelled training data may be collected for simple structures such as tokens to more complex tree/graph structures. Labelled data may also be used to identify patterns for encoding as rules.
  • hybrid combinations of the above – linked up in complex data flow pipelines performing different transformations in sequence.

A large number of online open source libraries and API services provide implementations of algorithms that address a range of the tasks outlined above. Each of them use one or more of the approaches listed above. I will reserve reviews of tasks/good libraries for future blog posts.  Performance of these libraries is highly variable – depending on the tuning effort (knowledge or supervised labelled data) and there is no one-size fits all. Performance is measured in terms of precision/recall  and also throughput (in both batch/real-time settings). For example, a generic POS tagger trained on the Brown Corpus cannot be effective on tagging Twitter chatter. Further, the libraries also vary on the underlying feature definition/management and data pipe-lining infrastructure.

Getting value out of open-source libraries requires a sound strategy of how to select, use and maintain these libraries inside your environment.  The approach that I usually follow (after learning from mistakes due to over-optimism and built-in biases!) is as follows:

  • Build a collection of test cases – with real data for your domain/problem of interest. These need to be reasonably large (covering at least the key “scenarios” of your application) and manually label them – especially by the folks using the aforementioned libraries. Labeling may involve both identifying the “input” patterns and the “output” classes/categories or whatever it may be.
  • Formulate a pipeline of the above core tasks from the libraries under consideration and run the test data thru these pipelines.  Also, define performance expectation in terms of how many requests per second, how many documents processed per sec or for interactive systems – round trip times etc. Compare the resulting outputs with the manually labelled results. Get a sense of how close/how far off the results are  while Identifying both expected behaviors and errors at every stage of the pipeline. Also, evaluate recall/precision metrics on your data versus what has been posted by the library developers. Academic and research projects only have to establish the occurrence of a phenomena and not deal with all cases. Establishing generality empirically in system building efforts is beyond their resources, so it behooves the commercial developer to really understand the boundaries of the libraries under consideration.
  • Analyze the experiment. Did we get things right because every stage of the pipeline worked ? Do we need additional knowledge ? Are there additional test cases that have been not covered ? Does the library need additional resources ? Do we need to tinker with the underlying code ? Does the library fit in with the  overall architectural requirements ? Can we incrementally modify the processing pipelines (split, add, modify new/old pipes) ? Can we extend the core library algorithms (new features, new memory manipulation techniques) ? Does the underlying rule-base/pattern engine or  the optimization solvers powering the machine learning algorithms really scale ? Are they built on well-known math libraries or home-grown codebases? Can we tune elements such as thread/process/memory allocations easily?
  • In most realworld systems, worth their salt – one needs to combine both “symbolic” and “statistical” knowledge (For background, please refer to the Chomsky/Norvig exchange).  To support such a strategy, is one library enough or do we need multiple? What kind of knowledge sources can we integrate reasonably effectively? For the application at hand, based on the initial experiments – is more labelled data going to improve performance (because you cover more of the “hypothesis” space) or more rules that generalize quite effectively? More  “labelled” data trumps in most use cases where behaviors that you are learning about on the average fall on the bell-curve. If your application focuses on the tails – even obtaining enough data to learn may be difficult. Usually the heuristic approach to follow: i)Lots of labelled data/easy to label – more data/simple algo, ii) less data then need more rules to generalize.  Obtaining “quality” labelled data is not as easy it is made out to be –  by running crowdsourcing exercises on AmazonTurk or other platforms, you may collect a lot of junk or redundant data. If your application is niche or your domain requires nuanced knowledge, one also has to plan carefully as how you get manually labelled “gold” standard data.  One should also consider how to incrementally obtain and integrate  such data and track quality improvements.

Suffice it to say that most anecdotal claims of performance, usability and other aspects of the libraries may not meet your own specific requirements, which is what you must focus on satisfying or identify early enough in your development process so that you can have a strategy in due course.  Hopefully, this short piece outlined key aspects of how to go about building/incorporating text analytics into your own applications. In future posts, I will address various related issues, for example,  a) generating test data for comparing performance, b) pointers to useful resources to build text-analysis centric systems, and  c) summary of recent research literature on the core tasks discussed above.


“Design” – Some thoughts

I recently read “Design of Design – Essays from a Computer Scientist” by Fred Brooks of the “Mythical Man Month” fame. The collection of essays attempts to make a larger point about the role of “design” in engineering activity as everyday systems get more complex for the average user. With the focus on “design” – primarily motivated by the success of Apple products – sweeping across the tech landscape and as I engage with my peers in the industry in various capacities (who each have a common-sense notion of “design”), I realize it is important to keep a few pointers in mind that can guide the “design” effort. The term “design” is both a verb (an action or process) and a noun (notionally the output of the process) leading to quite an overloaded term.

Having studied “design” in all its  nuances as an academic researcher in the days of  systems design research in the 90s and being an everyday “design” practitioner to-date,  following are some key conceptual pot-holes to understand:

  • Is One designing a product/system versus designing a process versus doing both: As a designer, one has to be clear what is the intended end result. For example, in the food processing industry, folks who make say “doughtnuts” or “potato chips” design large-scale processes with automation to make these items in large batches. Choosing the automation technology, figuring out the temperature of the vegetable oil to fry the doughnuts or potato wafers, designing techniques to sort and pack etc. are parts of the “process” design activity (while designing custom machinery is the product design activity). Similarly, the recent craze of 3-D printing (an innovation in process) – promises to lead to a new generation of products. Designing a physical widget – such as your everyday kitchen blender- is a product design activity. Choosing the motor, designing the blades, testing the effect of speed (frappe, blend, grind etc.) on different types of food, designing the blender jars (metal or glass), designing the housing, choosing the material for the same, even possibly designing/selecting  the manufacturing process all different aspects of the product design activity. A good example is the Dyson vacuum cleaners – new materials, new suction technology or the Roomba. In UX design (let us say in a mobile app), the designer defines both the process – how you want a consumer to engage/interact and the product – the widgets/”entities”/information units – the consumer interacts with. In most innovative/creative activities, new processes lead to new product designs, new product manufacturing requires new processes and so there is  a major interplay between these two aspects. However, as a designer, one tries to fix one aspect and focus on the other as the whole exercise is too complex. For example, in UX design, we assume the set of UI widgets is fixed (the UI library provides this) – the focus is on how you use these widgets to build the consumer/user’s journey.
  • Designing versus Planning:  Another source of  confusion (since design is a process) is the confounding of designing with planning. Both activities have “goals” and  constraints. However, design is an open-ended activity (the nature of the solution is not known).  Design is a many-sided problem with many degrees of freedom – one can change inputs, outputs or any other relevant aspect. You apriori do not know which one to start with. A final workable solution may or may not exist. Planning is a more well-specified activity and a solution can be found (however sub-optimal). Usually, planning deals with resource management (time, people, tools, space, material, money) and the aim is to achieve a goal (wherein the component tasks are pretty well-specified). The term “plan” and “design” are used inter-changeably in some contexts such as floor plans, layouts in architecture and landscaping.
  •  Viewing the design problem as search, constraint satisfaction, multi-objective optimization, problem-solving, model-building:  Numerous studies (including observations) have attempted to rationalize the design process (as transforming some inputs to outputs).  The problem has been modelled as search (based on Simon’s Sciences of the Artificial), constraint satisfaction problems (such as chess or the zebra puzzles), optimization (wherein a mathematical function is minimized/maximized), generic problem solving involving a generate-and-test approach and even model-building (where the result of the design process is a model of how things should be). No one paradigm captures all the ranges of a design problem apriori. As the design problem evolves, new requirements are discovered, the approach to solve the problem evolves.
  • Understanding the role of Analysis in Design: A key aspect of design activity is the sense of “assembling” piece-meal solution snippets (possibly components, ideas, concepts, repurposing previous solutions etc) in a functional whole that addresses the specified context. During this phase, one may have to delve deeper into the “analysis” of the component. This analysis is under-pinned by the scientific/technological knowledge amassed by the different sciences – applied and fundamental. Physics, chemistry, biology,mathematical/simulation models and more guide this analysis. Important to note that by definition, that each analysis will focus only one aspect of the final design. For example, in chip design – thermal analysis is separate from logic analysis etc. In UX design – the performance of an API call to the backend – its components etc.
  • Designing in relation to creativity, innovation, invention and discovery: During the process of design, the designer may have to “create” something new (from scratch, such as a new material or a new component or a new type of visual widget etc. that did not exist before). The result of the creative activity is an “invention” – something that exists now for the first time. Alternatively, the designer may creatively repurpose an already existing invention (such as velcro for shoe-straps or post-it) that meets the requirements of the existing context (with minor modifications to the existing invention). This “innovation” – helps solve the problem. In a different scenario, for example while creating the new material (as in the foregoing example), one may “discover” something new about the reality of quantum mechanical laws – new states, new behaviors. The key thing about “discovery” – the phenomena already exists (implicitly or explicitly in nature or in our man-made world), we uncover it and then exploit that principle to invent etc. There are many examples of serendipitous innovations and discoveries (such as penicillin).  However, the task of inventing is far more rigorous and time-consuming such as exhibited by Edison and his filament bulb (which are slowly fading away). Furthermore, “innovation” is usually in the eye of the beholder (what you did not know before, you consider it new!). In terms of increasing complexity – discovery, invention, innovation – all are aspects of “creativity” in design.

Wherever, man  has “engineered” the world around him, it requires a big element of design – though one may not recognize the same. For example,  the layout of a golf course – has a major element of landscape design along with “gamification” – creating flow (as defined by Cziksentmihalyi) – challenge and engage the players and enthrall the viewers. The course designer is constrained by the laws/rules of golf, the available terrain, the “ability” of current golfers (physical limits and the  technology behind golf equipment), the available budget and finally his own creativity to layout a course. The design variables include length of each hole, the ordering of the holes, the hazards on each hole and more.

Just as a though experiment, I thought I’d list  some of the different areas of design, which we run into on a daily basis  in our modern lives:

  • Industrial design/Furniture design/Fashion/Kitchenware/Interior Design
    Architecture/Urban design/Landscape design, Ergonomics Design
  • Design in computer systems – UX design, UI design, Mobile design, Game design, Schema/data design, Language/compiler design, DB design, SW design, OS design,visualization
  • Publishing Layouts, New Mag formats, Font design, Ad formats/layouts
  • Design in Engineering – Civil engineering (buildings, bridges, traffic infrastructure), Power engineering, electronics (HW design, analog design, chip design), Chemical engineering (materials, metals and more), mechanical engineering (your auto propulsion systems, HVAC and more)
  • Healthcare/Pharma – Drug design, hospital design, designing repeatable surgical processes, diagnostics, treatment plans, physiotherapy regimens, nutrition plans
  • Social Engineering/design – As new modes of man-machine interaction appear, how do we engineer those – interleaving human cognitive and psychological attributes.
  • Retail design – Showrooms, shopping systems
  • Financial Engineering – new mortgage and investment products, payment systems
  • Materials design – nanomaterials, materials that work at extremes of temperature, materials that are compatible biologically etc.
  • Food design – modern foods with GMOs, processing foods for preservation and long distance delivery, new flavors, new packaging, new “dining” experiences, new recipes.
  • Designing Organizations – the large-scale corporations with hierarchies to nimble teams (lean startups), hoteling in everyday work, Work from home movements, social organizations and more
  • Design in Humanities/Creative Writing/Content creation – Even the process of content creation has a basic template (tropes and styles) that generate engagement and way you go about delivering the experience

Every aspect of our life interacts with some designed/engineered artifact or process. As our modern lives get more complex, good design is essential to alleviate the travails of everyday living. One can imagine the amount of time wasted collectively because things are not designed the way they should have been (incomplete or rather erroneous on some dimension or other). We as consumers put up with it, hoping  to harvest any benefit we can get (Even Apple – I can imagine the number of adapters and connectors I have bought over the past few years!).

Designing for the modern consumer requires the “designer” to pay attention to so many aspects – it is quite daunting. Developing the sensibility of what the consumer would like, walk in their shoes, visualize the product/process and think holistically are hallmarks of a good designer. However, I believe it is doable and the benefits  are palpable for any one who engages with the end product of the design activity. In future posts, I will talk about what it takes to become a good designer.



Big Data – Ponderings of Practitioner – III

My earlier posts on this topic have taken a top down view – from model building to implementation. In this post, I attempt to understand the bottom up view of this world – the vendor landscape and tools in this space. Good summaries of this view are provided in:

a) Stonebraker’s post on Big Data – This takes the view – everything is data- It is large in volume, coming at great rates in realtime (velocity), and it is of different “types”/formats/semantics – variety. How do we support the conventional CRUD operations and more on this ever increasing dataset?
b) The vendor landscape as provided in a) , b) and a nuanced view of b) in c). In a) the big data world is seen as a continuation of the conventional world of database evolution in the past three decades evolved to include unstructured and streaming data, video,images and audio. b) and c) view it from the positioning of different “tech” buckets – each focused on “improving” some aspect of an implementation.

c) The analytics services view : Every worth while realworld application has some or all of the pieces of this architecture. One can pick a variety of tools for each component (open source or proprietary), jig them in different ways, use off-the-shelf tools such as R and SAS and more to analyze the data.

As I review the tools in this space, it is important to understand that these vendors value proposition is not to solve your “big data” problem – relevant to your business but to sell tools. Only after resolving the issues from a top down perspective, one can even constrain the technology choices and evolve the final solution incrementally. Vendors do not know your domain or your final application – so they cannot be held responsible. There are startups evolving in this space adopting  either the horizontal tech view (db tool, visualization tool, selling data) or the vertical view – solving a specific problem in a vertical – say marketing/advertising/wellness etc. For example, Google is a big data company dealing with advertising at scale (vertically oriented) – they built the big data toolkit to solve a vertical problem. Amazon’s recommender system is another application of big data at scale – for books (and later extended to other products).

Betting on a vertically oriented view has better odds since the key to getting value out of big data is “model building”. Model-free approaches to big data – free ranging analyses of data, tech investments without a well-bounded/specific purpose – are more or less bound to fail. Worthwhile/reliable models do not emerge out of the blue in any domain – it requires work. The business advantage is you get to exploit the benefits of the model till someone else figures it out which is how all science/tech works. So the key question is how does a tech leader do an “evaluative” project – that provides some guidance on big data investments given limited resources? I will have some thoughts on this in future posts on this topic.

Big Data – Ponderings of a Practitioner – II

Continuing the vein of thought outlined in my earlier post on Big Data, the key issue to have in mind before jumping into big data is: Are you collecting/analyzing data to build/improve a theory? or you have a theory that is guiding what data to collect so as to validate the theory or its predictions? Does the data come first, theory later or vice-versa? IMHO, as a practitioner, you need to go back-and-forth between the two views as they are two sides of the same coin – one guides the other.  Either view can be the starting point. Without Tycho Brahe’s data (the Rudolphine Tables), Kepler could not have formulated his laws. However,  Tycho Brahe was guided in his astronomical observations (what data to collect/tabulate?) by the goal to disprove/modify/improve the Ptolemaic/Copernican theory of the solar system. (As an anecdote, a recent data source is named Tycho in the field of medicine) However, one has to be careful in the context of the claims being made – Is the result of the big data analysis exercise an improvement to current (or extant) theory (science – new theory) or to practice (engineering – better calculation/accuracy/analysis in a specific context). This depends on the “generality” of the claim.  These issues have been well-discussed in the following resources (and it is worth the time spent reading them including the comment stream!)

  • The Chomsky/Norvig discussions a) and b). Though linguistics centric, the discussions  do address the big picture.
  • The Fourth Paradigm  – the book in memory of Jim Gray – discusses the interplay of big data and science
  • Beautiful Data – explores applications of big data

The data versus theory issues also have been discussed in the context of science and causality – Feynman provides a viewpoint in his lectures on Gravitation (See Section 7.7 – What is Gravitation?). Sometimes we do not need a machinery – as long as we can do more with the appropriate abstraction.

This issue of why one is pursuing Big data has to be clear apriori especially given the huge investments required in terms of big data skills, tools and other resources. In a way, every organization is going to have to setup  an in-house applied R&D group. In the context of businesses, it is worthwhile to ask: Are we doing enough with the “small” data that we have? As outlined in this recent HBR article, businesses can go a long way with whatever tech. investments they have made thus far.

Wearing the data scientists hat, How do you evaluate if a given problem is a worthwhile “big data” problem? Here are some key questions to ask to guide the thought process:

Model/Data Centric:

  • What is the basic hypothesis – for which analyzing more data will help? (For example, customer “shopping” behaviors are similar! so we collect enough data to validate this hypothesis and then predict a new unseen customer’s shopping behavior) Do you really need more data to make a decision ?(Consider the classic newsboy problem from OR) May be myopic rules work well enough in most dynamic situations!
  • Is the underlying phenomena(?) that generates this data stable, repeatable? Or is the data the amalgamation of multiple “generative” processes?(For example, traffic patterns during Thanksgiving are a result of more vehicles, heterogeneous drivers and vehicle mix (old cars, new cars, trucks)) Can you break the final observed pattern in terms of the causal components (behaviors of each constituent mentioned above)? What are the units of analysis – for which you need to collect data (longitudinally/temporally) , spatially or otherwise? Do you have the data in-house? What needs to be acquired from other parties and validate? What needs to be collected in-situ?
  • Is the theory that guided the “data collection”  well-developed? What was the objective behind the data collection effort that built the primary data source? (For example, census data is collected to know the number of folks living in a country (at a certain time point). By the time data is collected, folks are born and die, so what %age error is introduced by this process in your actual estimate at a later date?We can use this census data to “derive” a number of other conclusions – such as growth rates in different age groups). Furthermore, the objective introduces “biases” implicit in the data collection processes (issues such as over-generalization, loss of granularity etc.)
  • What is the effect of considering more data on the quality of the solution (possibly a prediction) ? Does it improve or degrade the solution? How do we ensure that we are not doing all the hardwork for no “real” benefit (though we may define metrics that show benefit!). Censoring the data is critical so that one analyzes phenomena in the appropriate “regimes”.
  • How do you capture the domain knowledge about the processes behind the “data”? Is it one person or a multi-disciplinary team? How do real processes in the domain affect the evolution of the data?
  • What does it mean to integrate multiple sources of data from within the domain and across domains? If across domains, what “valid” generalizations can one make? – these are your potential hypotheses. If conclusions are drawn, what are the implications in each source domain? What are the relationships between the different “elements” of the data? Are the relationships stable? do they change with time? When we integrate multiple sources of data, what are the boundaries – in representation, in semantics? How do we deal with duplicates, handle substitutes, reconcile inconsistencies in source data or inferences? These issues have been studied in the statistical context – called multiple imputation.
  • What do summaries of data tell you about the phenomena? Does the data trace collected cover all aspects of the potential data space or is it restricted to one sub-space of the phenomena? How do you ensure you have collected (or touched all aspects of this multi-dimensional data space)? Are current resources enough – How do you know you have all the data required? (even in terms of data fields?) ? What more will you need?
  • Overall, during the datafication phase – what are the goal(s) of the analysis? Are the goals aligned or in conflict? what kind of predictors are you looking for? How will one measure quality of results? Can we improve incrementally over time?

Computation/Resource Centric:

  • How many records are there to start with? (A record is for our purposes a basic unit of “data” – possibly with multiple attributes) – a row in a database table. What is changing about these records? How quickly? Is the change due to time? space? a combination of the two? Other extraneous/domain-specific processes (such as a human user or sensor or active agent)?
  • When combining data from multiple sources – how do you “normalize” data ? semantics? reconcile values/attributes? test for consistency?
  • How do you capture the data? What new tools do you need? Is the “sampling” of the data enough ? (when you discretize the analog world, understanding the effects of this essential). For example, when you sample some time phenomena at 1 sec intervals, you cannot say anything intelligent about phenomena that have a lifetime of less than 1 sec and fall between two sample points (a second apart). For example, census is collected every ten years. Are there population phenomena in the intervening decade that we miss? Do we even know what these are?
  • The engineering questions – Hadoop/map-reduce or regular DB alternatives, home grown tools, batchmode versus realtime, visualization (what to visualize and how?), rate of data growth? What data needs to be stored versus gotten rid of? merging/updating new data and dependent inferences? How to integrate resultant actions (in a feedback loop)? How do you figure if the feedback is positive or negative?

Process/Validation Centric:

  • How do we make sure the overall analysis is moving in the right direction? We need to ensure that the phenomena under study is not changing by the time we analyse the data.
    How do incrementally validate/verify the results/predictions? How do we checkpoint periodically?
  • What is the overall cost/benefit ratio of the whole analysis exercise? Is the benefit economically exploitable in a sustained manner?
  • How do we hedge and use the intermediary results for other “monetization” activities?

In the ensuing posts, I will discuss the overall process in the context of a project, I am current working on. Overall, investing in big data requires a good bit of upfront thought or else every organization will have large R&D expenses without any real ROI!

Social Media related R&D problems

Over the past two years and a half, I have been working closely on social media data from Twitter, FB, Google+ and others. After building and using real world systems that need to scale and having read a number of academic articles and noodling with prototype systems, much remains to be done. As I went through the process, I gathered a  working list of worthwhile R&D problems which I’d like to share below.

Computational Linguistics/Semantics problems

How does the “social media” language evolve? Twitter’s 140 character restriction drives a certain type of conversation. FB’s setup allows a different type of register. Tumblr is longer in nature. SMS and WhatsApp conversations are different. How do these conversations look in personal versus professional contexts? Comment chatter on blogs and topical sites is different (with an implicit theme/context). How do these affect language? (evolution of acronyms, social conventions – the back & forth, cues for follow up etc.) How do we detect these post-facto or real-time? What is the nature of one-to-many conversations and nature of “conversational threads” in these different registers? How does Entity recognition evolve to process social media data? How can Relationships, Actions, Events be detected from Social media data? How can sentiment/tone analysis be improved ? How does each of these look in different languages? How can you generate “social media” dialogues (moving beyond chatbots and the like)? How do you do supervised learning@scale ? What can be learnt via labeling via crowdsourcing versus what concepts cannot be? How do you understand aka infer “context” – what does “context” really mean? Is the current Geo  labeling (approaches) enough?  What kinds of sharing, response behaviors does one see and how are they cued (in text)? Is behavior similar across multiple languages? cultures? (that is do folks “chatter” the same way in English versus Spanish or other cultures?) How does Social media chatter leverage or enable “search” behavior? What if Twitter/FB had come up before Google for search? What behavioral/opinion changes can we detect from social chatter in space and time ? public opinion versus individual opinion?

Media Type problems

How do images/video/audio interplay with text? Role of Image recognition/object id/scene id in the context of social conversations (aka Instagram photos or Pinterest) ? How would we organize audio comments on Pandora ? Stitch images into an animation? How would we cluster “images” ? What kind of fingerprinting technologies can be used to “bundle” or group things together?

User-related problems

What kind of “profile” based inferences can one make from public social profiles and chatter? What kind of “persona” discovery? How can we link identities across social channels? (entity linking problem) How can one link offline/online “profiles”? What kind of “accounts” are spam or spurious accounts? What kind of “chatter” is spam?  What kind of chatter is generated by a bot? How can we use “geo” info  to know more about a user, or infer geo info about a user from their chatter  or provide content to a user based on their geo info? How does one generate “communities” – by similarity amongst users on what dimension(s)?

Content-related problems

How do we organize User-generated free form content (UGC)? Topics (Clustering and the like, how good are the approaches?) How do we link UGC with professionally created  content? How do we use social chatter to guide “recommendations” ? in what contexts? How do we categorize content? generate taxonomies auto-magically or update manually curated taxonomies on the leaves of a core structure?

Advertising Eco-system problems

What does it mean to advertise socially? Paid media attribution problems – which channel worked for which kind of content at what time for what kind of user? How should the ad copy/message look like? How does social earned media  interplay with paid media advertising – say display advertising and PPC advertising. How should publisher’s promote content in social media ? Which order ? What ad copy? Which snippet? How should they link different online/offline properties?


How to identify, track, secure and protect content sharing –  paid/free text, images, video? (digital rights, watermarking, analytics at scale). How does “Social” interplay with TV ? (aka like Twitter for large sample feedback on a show or ad) Are the statistical inferences really valid (Is Twitter really representative of the larger population)? How would you test for the same? What kind of surveys/experiments/probes can you run real-time to drive/guide an inference – automated design of experiments ? How do we “simulate” Twitter/FB behavior (structurally and “dynamically”) ? What “dynamics” can we model in such systems? What kind of “actions” can one suggest from such simulations? (Power law behaviors). Standard problems such as spam detection, de-duplication, link duplication, content segmentation, summary generation in the context of social media chatter.


Big Data – Ponderings of a Practitioner – I

Having worn multiple hats related to dealing with data on a near-daily basis through my professional career – data modeling, schema/representation development,  redesign for indexing and query performance, information extraction, feature extraction for machine learning etc. I have been following the evolution of data modeling/analytics from just data to “big data” over the past two decades. Curious over the recent crescendo, I just finished reading two books addressed to the laymen, business practitioners/CXO’s, investors, TED attendees and the folks looking to ride the next buzz (now that Twitter’s IPO is done !).  and just for sanity, an old paper which I had read quite a while ago.

These books are:

a) Nate Silver’s The Signal and the Noise: Why So Many Predictions Fail-but Some Don’t . (For those interested, a quick review from an eminent statistician)

b) Big Data: A Revolution That Will Transform How We Live, Work, and Think  by Viktor Mayer Schonberger and  Kenneth Cukier 

and the paper (still relevant after a decade ) by Prof. Leo Breiman.

These books raised a number of related thoughts which I would like to share and follow up in greater detail  in future posts. Before getting to  those, I thought I’d share a summary of the above material which frames some of the issues in the “Big Data” movement in terms of my own paradigm:

Big data is “old wine” in a new bottle. So what’s changed ? The belief is:  We are producing/gathering data faster, hence can we drive the cycle of empiricism (aka the scientific method) faster.  The idea is to reduce the time span to  a) formulate theory, b)  collect data, c) develop a model or models,  d) predict/infer based on a model and test, and finally, e) revise beliefs in a), thus leading to the next cascade of activities. Why do we want to this ?  Schonberger et al. posit that: We want answers/benefits now, we believe since there is more data (we should do better but on what?), We really do not need answers to the question Why but just What is (so in this case what does it mean to predict something when I can look it up in a table). To build such a table (along multiple dimensions), I need to digitize/quantify/represent the world (aka datafication). Given the data (of different aspects of the world), we can combine, extend, reuse, repurpose the same to support other theories (than the original intent the data was collected for). However, it is user beware when one does this because when this is done, the models or inferences are not tested under all possible usage contexts. Furthermore, there is a range of  new “risks” – personal privacy violations (in context of consumer data) and other (organizational intelligence, cyber-espionage) in the context of how many touch points are to the data and in how many contexts it is utilized and who controls the data and its uses. Additionally, another key viewpoint raised is we do not need “general theories” that apply to a large number of data points but a number of “small theories” that address small samples is good enough (where the small samples are defined contextually). With the availability of computing technology, we can work with a number of small sample theories to solve “small problems” which are large in number (hence Big data!).  When you really think about it, in the best (or is it worst?) case scenario, each “realworld” individual is a small sample – the grail of “personalization”.

Silver in his book focuses on the process of using data – promoting a Bayesian worldview of updates on beliefs to distinguish the “signal” from the “noise”. However, his discussions are framed in a frequentist perspective. To perform updates on beliefs or even apriori discriminate signal from noise – one needs a model – the baseline. He discusses the limitations of models in different domains. See Table below.  The rows indicate if the primary domain models are deterministic (Newton’s laws, rules of Chess) or inherently probabilistic. The columns indicate the completeness of our current knowledge – do we have all the models for phenomena in those domains.

 Complete Domain KnowledgeIncomplete Domain KnowledgeComments
DeterministicAnalog - Weather
Digital - Chess
Analog - Earthquake
Digital - Worldwar II events, Terrorism events
The term "Digital" is used to mean "quantized", "Analog" - refers to the physical world
Probabilistic/StatisticalDigital -Finance,Baseball,Poker
Analog - Politics
Digital - Economy, Basketball
Analog - Epidemics, Global Warming

Improving the models or making a prediction requires analysis of data in context of the model. The book highlights the notion of  “prediction” –  saying something about something in the future. However, it sheds little light on  “inference” – saying something new about something at present – thus “completing ones incomplete knowledge”. However, the sources of uncertainity are different in the models – A deterministic game like Chess has uncertainity introduced because of player behavior and the search tree whereas earthquake prediction is uncertain because we just do not know enough (our model is complete). The book however makes no statements on the role of “Big data” per se in terms of tools (all the analysis in the book can be done with spreadsheets or on “R”). Furthermore, the book highlights the different types of analyst “bias” that may be introduced in the inferences and predictions.

In contrast to the books, the paper makes a case for a change in the approach to professional  model building  by statisticians. Instead of picking the model and fitting data, let the data pick the model for you. Though if one is a professional statistician, one would search through the space of potential models (via a generate-and-test approach) and finally picking the appropriate one on some criteria. Considering the premise of this paper, one can see a potential use-case for big data in the model building lifecycle reconciling the different discussions in the books.

Will have more on this topic in future posts.

Addendum: See the latest article on this topic in CACM.