Ruminations on AI, ML, Data Science – where are we?

This is the first post after a long hiatus from active blogging. Having spent the past nearly 3+ years running a tech – (all of product, data and engineering) – team out of Bangalore, I have learnt much as  how things are evolving in India and hopefully will get the opportunity to share the same on this thread. In the meanwhile, over the past couple of weeks, I have spent some time looking at the whole evolution in ML, AI and its sister disciplines over this time frame, just to baseline myself. Shown below is a proxy indicator for “popularity” of these terms over this time period (Please click on the image for a magnified view).

Overall, these trends suggest lots of activity in the “enabling with AI”  space across many application domains. One key thing to note is the downward trend in these graphs over the past quarter.  Is this a leading indicator of things to come? Too early to call. As things stand, everyone in tech is doing some sort of ML or data science or wants to do something with data. We have also seen lots of venture funding and a lot of innovation, entrepreneurial and tech activity over this time frame.

Firstly, what good things have come out during this period (perhaps my biased view) ?

  1. Deep Learning as a viable tool in the AI toolkit – lots of libraries, custom built hardware – chips and GPUs, a better understanding of where it works – image/video processing, speech, text – and where it does not and a much clearer focus on what are the really hard problems to focus on, the need for explainability, the costs of compute doing data intensive AI etc. The key thing about this journey is a better understanding of the complexities of building AI systems – even for R&D purposes.
  2.  Improvements in Basic Modeling around Causality – For those who have been following the research around Uncertainity and AI – the basic structuring and formalization of Causality is a key first step in building complex reasoning systems. We finally have some kind of scaffolding to reason with data and build more reliable “chunks” of knowledge. Deterministic and Probabilistic/data-driven knowledge may be combined effectively.
  3. Automated Driving and other applications of AI – as folks in a variety of domains adopt AI/machine intelligence etc. More experimentation with the tools of AI builds an awareness and experience of the complexities of AI. Domains of application range from agriculture, computational in-silico biology, health care to many more.
  4.  Improvements in hardware –  Availability of advanced chips, custom compute/data acquisition devices, sensors, chips for signal/image/speech processing, drone technologies, cheaper/smaller/energy efficient data/power storage tech., have made many real world problems tractable in an engineering sense. Investments required to build heterogeneous platforms to solve realworld problems are reasonable. AI systems can utilize these platforms for data/knowledge acquisition and interaction with other systems and users. Improvements in components, communication devices, mechanisms, new materials etc. are driving advances in different kinds of “robotic” – aka mechatronic systems. What is a robot versus a “mechatronic” system ? the boundary between these two categories is the amount of autonomy that the system has embedded in it.
  5. AI – software toolkit “commoditization” – Advances in AI toolkits, development libraries, Cloud infra – AWS, GCP, Azure – has helped teams focus on building the intelligence. SW systems engineering is a minor issue  in the basic proof-of-concept phase of the product development lifecycle. Teams can focus on “capturing” and modeling the intelligence as SW has matured to a reliable point. AI APIs help one at least evaluate the viability of an idea.
  6.  Latent Commercial Potential – Embedding intelligence in any process promises a variety of economic and related benefits – either for a consumer or an enterprise.  Under the rubric of AI for the Enterprise – a number of initiatives are being pursed such as – 1) the advent of “chatbot” technology/platforms for intra and inter-organizational activities,  2) advanced data exploration and visualization technologies – as enterprises focus on exploiting data for better decision-making , 3) Investments in Big data infrastructure and teams – as companies seek a better understanding of how to use AI technologies internal and external to an organization. Much of this has been driven by expanding consumer expectations – (driven by advanced voice/speech/video technologies) –  and due to the availability of large amounts of proprietary data – internally and externally in organizations. Also, large scale AI R&D has moved to quasi-commercial and commercial entities rather than being purely academic (such as the Allen Institute and others), and the Chinese (private + public) initiatives.

However, there are a few key issues, that are yet to addressed in a concrete manner and only slowly being understood. These are :-

  1. Misunderstanding the Role of Data  versus Role of Knowledge in building Intelligent Systems – This is a very important problem. Current ML/Deep learning driven approaches implicitly believe – given enough data in any domain, “black-box” systems can be learnt. The primary motivation for the rise in ML-based systems was the reason that “knowledge” need not be painfully encoded into a system but can be seamlessly learnt from data. Since it is learning from data, the system will be less brittle. Unfortunately, as folks are realizing even data-driven systems are highly brittle – they are “knowledge weak” – and one does not even know where.  So where do we go from here ? How do be we build systems that accommodate both and evolve as each type of “knowledge” evolves ? It is also important to remember that when we identify something as “data” – there is an implicit model behind it. For example, even a “concept” such as an “address” – has a mental model – aka knowledge. Additionally, intelligence can be encoded in the “active agent” or in the environment or a combination of both. For example, guidance strips to help AGVs navigate on the shop floor is structuring the environment. Building a more general purpose AGV that uses LIDAR and sensors is encoding intelligence in the active agent. It is apriori unclear what should be done with respect to these choices in most domains.
  2. Complexity of Collecting Data and Ground Truth for different phenomena – Most supervised learning approaches need data collection at scale. Also required is labelled data at scale which is of reasonably high quality. Developing a viable approach to do this is a highly iterative process. Crowd-sourcing approaches only help so much – may be they are good for “common-sense” tasks, but may be totally infeasible for “specialized” tasks – where background knowledge is required.  How does one apriori know what kind of approach should one take – purely data driven versus knowledge driven ?  What kind of “knowledge gathering” activities should one do?
  3. Engineering without the Science –  – the growing divide – Capturing intelligence from data – as models/rules etc and validating them  for stability/robustness etc is an extremely difficult problem. Engineering as I mentioned is the easy part.  In most organizations, engineering progress can be shown with minimal effort whereas data science/AI work takes time. The organizational divide that is engendered by this is a major issue. It is important to realize that without the intelligence in a domain codified, driving your product evolution based on engineering metrics is counter productive.
  4.  Are we setting ourselves for an AI winter or  AI ice-age ? – With the amount of buzz in popular media around AI technologies and the huge investments being made, it is unclear if time horizons to solve these problems are being estimated properly. We have started seeing signs of major failures in many domains, but these are early days. Both optimists and pessimists abound but the reality is somewhere in-between.  There is much we still do not know and have yet to discover – what is intelligence ? how is it in biological systems? How is it to be captured and engineered in artificial systems? How is intelligence defined/modeled for an individual versus for a group ? How does intelligence evolve ? and many more. As society embarks on this quest – it is important that we do not throw out the baby with the bathwater – we do not even recognize the baby – in many of these situations. However, one thing is certainly clear – as things stand – we understand very little about “intelligence” – human or otherwise. Our understanding about intelligence will evolve considerably over the coming years and in the process – we will build systems – some successful/some not so much.

The aforementioned issues have been around from the early days of AI (since the 1950s) and the core issues around understanding the brain, modeling cognition (Does it even take place in the brain?), knowledge representation, Is one kind of knowledge enough for a task, How do measure/evaluate intelligence etc. require further basic research. Progress in basic research has been slow as the problems are extremely difficult. Applied AI may be on the right path but much engineering needs to be done – many systems need to be built in many domains and lessons learnt. However, this collective quest to build AI systems – may finally drive us to focus on problems of real value and real impact in the long run. It is not yet too late to join the journey. Every problem worth solving is still being looked at! More on these problems in upcoming blog entries.

Text analytics – a practitioner’s view – I

After nearly a four month hiatus from blogging, I finally got around to writing up the first in  a series of blogs  on the state of art in text analytics. I have been interacting with various practitioners in both industry and academia on this topic over the past year. The general “notion” (atleast in industry) seems to be that text analytics is a nearly “solved” problem at least for most common commercial purposes (“Google does it” or “Netflix does it” or “FB does it” are some common refrains!). The belief seems to be open-source NLP codebases are good enough to build production ready systems.  IMHO, the field of text analytics is still wide open, many algorithms could be improved considerably and much remains to be done. As a practitioner in this space for more than a decade, I share  below some key observations as text analytics becomes a core component in a developers’ toolkit.

The key tasks are – traversing up the hierarchy from “character”-level analysis  to higher-level abstractions are:

  • Language encodings (UTF-8,ISO-8891, Unicode-16,Unicode-32 – knowing the encoding of the document is the first step!)
  • Language detection (different languages potentially use the same glyph – latin languages, cyrhillic languages share glyphs)
  • Spelling correction
  • Sentence boundary detection (When does a period signify end of sentence?)
  • Concordancing, Co-occurrence analyses
  • Tokenization
  • Lemmatization
  • Stemming
  • Part-of-Speech (POS) tagging
  • Word-sense disambiguation
  • Acronyms and other kinds of textual artifacts
  • Morphological analysis
  • Language modeling (n-gram analysis)
  • Corpus analyses, language evolution, empirical characterization
  • Chunking
  • Noun-phrase chunking, Verb-phrase chunking
  • Named Entity Recognition (NER)
  • Semantic Role Labeling
  • Relation extraction
  • Event extraction
  • Coreference Resolution (anaphora, cataphora, appositives)
  • Parsing – dependency, probabilistic CFG, clause detection, phrase analysis
  • Affect analysis – sentiment, sarcasm, metaphors and more, objective versus subjective/opinion classification
  • Intent analysis
  • Dialogue analysis
  • Semantic entailment
  • Categorization – text classification along many content/meta-data driven dimensions
  • Summarization (single document, paragraph, multi-document)
  • Author characterization based on text they create, different types of source analysis
  • Plagiarization, similarity detection, deception detection
  • Clustering – theme analysis and identification
  • Search, question answering, query expansion
  • Topic identification, topic tracking
  • Predicate analysis – document to graph (generating triples and other “semantic” structures)
  • Text synthesis – dialogue generation, response generation
  • Text translation – machine translation from one language to another
  • Text-image correlation
  • Text-to-speech and vice-versa (utterances introduce their own nuances)
  • Basic text data processing/data cleanup/normalization/standardization (generating data for mashups and more)

The aforementioned list is not comprehensive but addresses most of the core “NLP” tasks that form the basis of many commercial applications – recommender systems, social media analysis, document analysis, Ontology development, Semantic Web and more. Furthermore, for many real world contexts, many of these tasks need to be interleaved (possibly in different sequences), depending on the domain.  Additionally, approaches to these tasks also depend on the nature, aka “register” of the text such as:

a) written text versus spoken text
b) well-written editorials – with good grammar versus badly-written text, text with many pictures (sketches/diagrams and other visual aids)
c) Long documents (books/monographs/journals/legal cases/legal contracts/news reports) versus small documents (emails, blogs)
d) Short noisy text – Social commentary/conversations – social chatter, comment streams, command streams in emergency services, radio chatter
e) speech to text dialogs – voice complaints, voice mail, dialog transcriptions, closed caption text (in video), dictations (physician’s notes)

Furthermore, the text can be in different languages (or some documents can be in  mixed languages). Each language has its own “grammatical” substrate – SVO (subject-verb-object) versus OVS and combinations therein, based on the underlying “language” family. Furthermore, some languages have multiple “written” scripts (different set of “glyphs”), (such as Japanese kanji and kana). Furthermore- for the same language, accents, lingo, common idioms of use and such vary – geographically and culturally (imagine differences in English spoken in England versus the US versus India versus Australia). This adds another layer of complexity to the task. Also “real natural language” has many “special cases” affecting the semantics and pragmatic analysis of the text.

For each of the afore-mentioned tasks,  most worthwhile approaches include the following:

  • building a domain/context-specific rule-base/pattern-base that identifies and tags the data appropriately. Additional “knowledge” may be used – such as lexicons, pre-codified taxonomies and more.
  • collecting labelled training data and then training “classifiers” (for both sequential/non-sequential data) – linear classifiers, SVMs, neural nets and more. Labelled training data may be collected for simple structures such as tokens to more complex tree/graph structures. Labelled data may also be used to identify patterns for encoding as rules.
  • hybrid combinations of the above – linked up in complex data flow pipelines performing different transformations in sequence.

A large number of online open source libraries and API services provide implementations of algorithms that address a range of the tasks outlined above. Each of them use one or more of the approaches listed above. I will reserve reviews of tasks/good libraries for future blog posts.  Performance of these libraries is highly variable – depending on the tuning effort (knowledge or supervised labelled data) and there is no one-size fits all. Performance is measured in terms of precision/recall  and also throughput (in both batch/real-time settings). For example, a generic POS tagger trained on the Brown Corpus cannot be effective on tagging Twitter chatter. Further, the libraries also vary on the underlying feature definition/management and data pipe-lining infrastructure.

Getting value out of open-source libraries requires a sound strategy of how to select, use and maintain these libraries inside your environment.  The approach that I usually follow (after learning from mistakes due to over-optimism and built-in biases!) is as follows:

  • Build a collection of test cases – with real data for your domain/problem of interest. These need to be reasonably large (covering at least the key “scenarios” of your application) and manually label them – especially by the folks using the aforementioned libraries. Labeling may involve both identifying the “input” patterns and the “output” classes/categories or whatever it may be.
  • Formulate a pipeline of the above core tasks from the libraries under consideration and run the test data thru these pipelines.  Also, define performance expectation in terms of how many requests per second, how many documents processed per sec or for interactive systems – round trip times etc. Compare the resulting outputs with the manually labelled results. Get a sense of how close/how far off the results are  while Identifying both expected behaviors and errors at every stage of the pipeline. Also, evaluate recall/precision metrics on your data versus what has been posted by the library developers. Academic and research projects only have to establish the occurrence of a phenomena and not deal with all cases. Establishing generality empirically in system building efforts is beyond their resources, so it behooves the commercial developer to really understand the boundaries of the libraries under consideration.
  • Analyze the experiment. Did we get things right because every stage of the pipeline worked ? Do we need additional knowledge ? Are there additional test cases that have been not covered ? Does the library need additional resources ? Do we need to tinker with the underlying code ? Does the library fit in with the  overall architectural requirements ? Can we incrementally modify the processing pipelines (split, add, modify new/old pipes) ? Can we extend the core library algorithms (new features, new memory manipulation techniques) ? Does the underlying rule-base/pattern engine or  the optimization solvers powering the machine learning algorithms really scale ? Are they built on well-known math libraries or home-grown codebases? Can we tune elements such as thread/process/memory allocations easily?
  • In most realworld systems, worth their salt – one needs to combine both “symbolic” and “statistical” knowledge (For background, please refer to the Chomsky/Norvig exchange).  To support such a strategy, is one library enough or do we need multiple? What kind of knowledge sources can we integrate reasonably effectively? For the application at hand, based on the initial experiments – is more labelled data going to improve performance (because you cover more of the “hypothesis” space) or more rules that generalize quite effectively? More  “labelled” data trumps in most use cases where behaviors that you are learning about on the average fall on the bell-curve. If your application focuses on the tails – even obtaining enough data to learn may be difficult. Usually the heuristic approach to follow: i)Lots of labelled data/easy to label – more data/simple algo, ii) less data then need more rules to generalize.  Obtaining “quality” labelled data is not as easy it is made out to be –  by running crowdsourcing exercises on AmazonTurk or other platforms, you may collect a lot of junk or redundant data. If your application is niche or your domain requires nuanced knowledge, one also has to plan carefully as how you get manually labelled “gold” standard data.  One should also consider how to incrementally obtain and integrate  such data and track quality improvements.

Suffice it to say that most anecdotal claims of performance, usability and other aspects of the libraries may not meet your own specific requirements, which is what you must focus on satisfying or identify early enough in your development process so that you can have a strategy in due course.  Hopefully, this short piece outlined key aspects of how to go about building/incorporating text analytics into your own applications. In future posts, I will address various related issues, for example,  a) generating test data for comparing performance, b) pointers to useful resources to build text-analysis centric systems, and  c) summary of recent research literature on the core tasks discussed above.