Text analytics - a practitioner's view - I

After nearly a four month hiatus from blogging, I finally got around to writing up the first in a series of blogs on the state of art in text analytics. I have been interacting with various practitioners in both industry and academia on this topic over the past year. The general “notion” (atleast in industry) seems to be that text analytics is a nearly “solved” problem at least for most common commercial purposes (“Google does it” or “Netflix does it” or “FB does it” are some common refrains!). The belief seems to be open-source NLP codebases are good enough to build production ready systems. IMHO, the field of text analytics is still wide open, many algorithms could be improved considerably and much remains to be done. As a practitioner in this space for more than a decade, I share below some key observations as text analytics becomes a core component in a developers’ toolkit.

The key tasks are – traversing up the hierarchy from “character”-level analysis to higher-level abstractions are:

Language encodings (UTF-8,ISO-8891, Unicode-16,Unicode-32 – knowing the encoding of the document is the first step!)
Language detection (different languages potentially use the same glyph – latin languages, cyrhillic languages share glyphs)
Spelling correction
Sentence boundary detection (When does a period signify end of sentence?)
Concordancing, Co-occurrence analyses
Tokenization
Lemmatization
Stemming
Part-of-Speech (POS) tagging
Word-sense disambiguation
Acronyms and other kinds of textual artifacts
Morphological analysis
Language modeling (n-gram analysis)
Corpus analyses, language evolution, empirical characterization
Chunking
Noun-phrase chunking, Verb-phrase chunking
Named Entity Recognition (NER)
Semantic Role Labeling
Relation extraction
Event extraction
Coreference Resolution (anaphora, cataphora, appositives)
Parsing – dependency, probabilistic CFG, clause detection, phrase analysis
Affect analysis – sentiment, sarcasm, metaphors and more, objective versus subjective/opinion classification
Intent analysis
Dialogue analysis
Semantic entailment
Categorization – text classification along many content/meta-data driven dimensions
Summarization (single document, paragraph, multi-document)
Author characterization based on text they create, different types of source analysis
Plagiarization, similarity detection, deception detection
Clustering – theme analysis and identification
Search, question answering, query expansion
Topic identification, topic tracking
Predicate analysis – document to graph (generating triples and other “semantic” structures)
Text synthesis – dialogue generation, response generation
Text translation – machine translation from one language to another
Text-image correlation
Text-to-speech and vice-versa (utterances introduce their own nuances)
Basic text data processing/data cleanup/normalization/standardization (generating data for mashups and more)

The aforementioned list is not comprehensive but addresses most of the core “NLP” tasks that form the basis of many commercial applications – recommender systems, social media analysis, document analysis, Ontology development, Semantic Web and more. Furthermore, for many real world contexts, many of these tasks need to be interleaved (possibly in different sequences), depending on the domain. Additionally, approaches to these tasks also depend on the nature, aka “register” of the text such as:

a) written text versus spoken text
b) well-written editorials – with good grammar versus badly-written text, text with many pictures (sketches/diagrams and other visual aids)
c) Long documents (books/monographs/journals/legal cases/legal contracts/news reports) versus small documents (emails, blogs)
d) Short noisy text – Social commentary/conversations – social chatter, comment streams, command streams in emergency services, radio chatter
e) speech to text dialogs – voice complaints, voice mail, dialog transcriptions, closed caption text (in video), dictations (physician’s notes)

Furthermore, the text can be in different languages (or some documents can be in mixed languages). Each language has its own “grammatical” substrate – SVO (subject-verb-object) versus OVS and combinations therein, based on the underlying “language” family. Furthermore, some languages have multiple “written” scripts (different set of “glyphs”), (such as Japanese kanji and kana). Furthermore- for the same language, accents, lingo, common idioms of use and such vary – geographically and culturally (imagine differences in English spoken in England versus the US versus India versus Australia). This adds another layer of complexity to the task. Also “real natural language” has many “special cases” affecting the semantics and pragmatic analysis of the text.

For each of the afore-mentioned tasks, most worthwhile approaches include the following:

building a domain/context-specific rule-base/pattern-base that identifies and tags the data appropriately. Additional “knowledge” may be used – such as lexicons, pre-codified taxonomies and more.
collecting labelled training data and then training “classifiers” (for both sequential/non-sequential data) – linear classifiers, SVMs, neural nets and more. Labelled training data may be collected for simple structures such as tokens to more complex tree/graph structures. Labelled data may also be used to identify patterns for encoding as rules.
hybrid combinations of the above – linked up in complex data flow pipelines performing different transformations in sequence.

A large number of online open source libraries and API services provide implementations of algorithms that address a range of the tasks outlined above. Each of them use one or more of the approaches listed above. I will reserve reviews of tasks/good libraries for future blog posts. Performance of these libraries is highly variable – depending on the tuning effort (knowledge or supervised labelled data) and there is no one-size fits all. Performance is measured in terms of precision/recall and also throughput (in both batch/real-time settings). For example, a generic POS tagger trained on the Brown Corpus cannot be effective on tagging Twitter chatter. Further, the libraries also vary on the underlying feature definition/management and data pipe-lining infrastructure.

Getting value out of open-source libraries requires a sound strategy of how to select, use and maintain these libraries inside your environment. The approach that I usually follow (after learning from mistakes due to over-optimism and built-in biases!) is as follows:

Build a collection of test cases – with real data for your domain/problem of interest. These need to be reasonably large (covering at least the key “scenarios” of your application) and manually label them – especially by the folks using the aforementioned libraries. Labeling may involve both identifying the “input” patterns and the “output” classes/categories or whatever it may be.
Formulate a pipeline of the above core tasks from the libraries under consideration and run the test data thru these pipelines. Also, define performance expectation in terms of how many requests per second, how many documents processed per sec or for interactive systems – round trip times etc. Compare the resulting outputs with the manually labelled results. Get a sense of how close/how far off the results are while Identifying both expected behaviors and errors at every stage of the pipeline. Also, evaluate recall/precision metrics on your data versus what has been posted by the library developers. Academic and research projects only have to establish the occurrence of a phenomena and not deal with all cases. Establishing generality empirically in system building efforts is beyond their resources, so it behooves the commercial developer to really understand the boundaries of the libraries under consideration.
Analyze the experiment. Did we get things right because every stage of the pipeline worked ? Do we need additional knowledge ? Are there additional test cases that have been not covered ? Does the library need additional resources ? Do we need to tinker with the underlying code ? Does the library fit in with the overall architectural requirements ? Can we incrementally modify the processing pipelines (split, add, modify new/old pipes) ? Can we extend the core library algorithms (new features, new memory manipulation techniques) ? Does the underlying rule-base/pattern engine or the optimization solvers powering the machine learning algorithms really scale ? Are they built on well-known math libraries or home-grown codebases? Can we tune elements such as thread/process/memory allocations easily?
In most realworld systems, worth their salt – one needs to combine both “symbolic” and “statistical” knowledge (For background, please refer to the Chomsky/Norvig exchange). To support such a strategy, is one library enough or do we need multiple? What kind of knowledge sources can we integrate reasonably effectively? For the application at hand, based on the initial experiments – is more labelled data going to improve performance (because you cover more of the “hypothesis” space) or more rules that generalize quite effectively? More “labelled” data trumps in most use cases where behaviors that you are learning about on the average fall on the bell-curve. If your application focuses on the tails – even obtaining enough data to learn may be difficult. Usually the heuristic approach to follow: i)Lots of labelled data/easy to label – more data/simple algo, ii) less data then need more rules to generalize. Obtaining “quality” labelled data is not as easy it is made out to be – by running crowdsourcing exercises on AmazonTurk or other platforms, you may collect a lot of junk or redundant data. If your application is niche or your domain requires nuanced knowledge, one also has to plan carefully as how you get manually labelled “gold” standard data. One should also consider how to incrementally obtain and integrate such data and track quality improvements.

Suffice it to say that most anecdotal claims of performance, usability and other aspects of the libraries may not meet your own specific requirements, which is what you must focus on satisfying or identify early enough in your development process so that you can have a strategy in due course. Hopefully, this short piece outlined key aspects of how to go about building/incorporating text analytics into your own applications. In future posts, I will address various related issues, for example, a) generating test data for comparing performance, b) pointers to useful resources to build text-analysis centric systems, and c) summary of recent research literature on the core tasks discussed above.