Social Media related R&D problems

Over the past two years and a half, I have been working closely on social media data from Twitter, FB, Google+ and others. After building and using real world systems that need to scale and having read a number of academic articles and noodling with prototype systems, much remains to be done. As I went through the process, I gathered a  working list of worthwhile R&D problems which I’d like to share below.

Computational Linguistics/Semantics problems

How does the “social media” language evolve? Twitter’s 140 character restriction drives a certain type of conversation. FB’s setup allows a different type of register. Tumblr is longer in nature. SMS and WhatsApp conversations are different. How do these conversations look in personal versus professional contexts? Comment chatter on blogs and topical sites is different (with an implicit theme/context). How do these affect language? (evolution of acronyms, social conventions – the back & forth, cues for follow up etc.) How do we detect these post-facto or real-time? What is the nature of one-to-many conversations and nature of “conversational threads” in these different registers? How does Entity recognition evolve to process social media data? How can Relationships, Actions, Events be detected from Social media data? How can sentiment/tone analysis be improved ? How does each of these look in different languages? How can you generate “social media” dialogues (moving beyond chatbots and the like)? How do you do supervised learning@scale ? What can be learnt via labeling via crowdsourcing versus what concepts cannot be? How do you understand aka infer “context” – what does “context” really mean? Is the current Geo  labeling (approaches) enough?  What kinds of sharing, response behaviors does one see and how are they cued (in text)? Is behavior similar across multiple languages? cultures? (that is do folks “chatter” the same way in English versus Spanish or other cultures?) How does Social media chatter leverage or enable “search” behavior? What if Twitter/FB had come up before Google for search? What behavioral/opinion changes can we detect from social chatter in space and time ? public opinion versus individual opinion?

Media Type problems

How do images/video/audio interplay with text? Role of Image recognition/object id/scene id in the context of social conversations (aka Instagram photos or Pinterest) ? How would we organize audio comments on Pandora ? Stitch images into an animation? How would we cluster “images” ? What kind of fingerprinting technologies can be used to “bundle” or group things together?

User-related problems

What kind of “profile” based inferences can one make from public social profiles and chatter? What kind of “persona” discovery? How can we link identities across social channels? (entity linking problem) How can one link offline/online “profiles”? What kind of “accounts” are spam or spurious accounts? What kind of “chatter” is spam?  What kind of chatter is generated by a bot? How can we use “geo” info  to know more about a user, or infer geo info about a user from their chatter  or provide content to a user based on their geo info? How does one generate “communities” – by similarity amongst users on what dimension(s)?

Content-related problems

How do we organize User-generated free form content (UGC)? Topics (Clustering and the like, how good are the approaches?) How do we link UGC with professionally created  content? How do we use social chatter to guide “recommendations” ? in what contexts? How do we categorize content? generate taxonomies auto-magically or update manually curated taxonomies on the leaves of a core structure?

Advertising Eco-system problems

What does it mean to advertise socially? Paid media attribution problems – which channel worked for which kind of content at what time for what kind of user? How should the ad copy/message look like? How does social earned media  interplay with paid media advertising – say display advertising and PPC advertising. How should publisher’s promote content in social media ? Which order ? What ad copy? Which snippet? How should they link different online/offline properties?


How to identify, track, secure and protect content sharing –  paid/free text, images, video? (digital rights, watermarking, analytics at scale). How does “Social” interplay with TV ? (aka like Twitter for large sample feedback on a show or ad) Are the statistical inferences really valid (Is Twitter really representative of the larger population)? How would you test for the same? What kind of surveys/experiments/probes can you run real-time to drive/guide an inference – automated design of experiments ? How do we “simulate” Twitter/FB behavior (structurally and “dynamically”) ? What “dynamics” can we model in such systems? What kind of “actions” can one suggest from such simulations? (Power law behaviors). Standard problems such as spam detection, de-duplication, link duplication, content segmentation, summary generation in the context of social media chatter.


Exploiting Social Media – Some lessons

The past 12 months have been a major learning experience as I transition from my recent social media startup.  After a hiatus of nearly a year from blogging,  I wanted to summarize some key lessons as I rode the downslide of the first social media wave. These include:

  • Marketers are still trying to figure out does social media work (in a reliable manner compared to other online channels). Selling tools and me-too products is just not enough.
  • Social media advertising is still a variant of  the good old genre – “display advertising” – call it promoted tweets, sponsored stories etc. From a marketer’s perspective, easier to do another ad copy for a channel rather than focus on costly prospect specific engagement.
  • Running social media campaigns effectively is a costly affair. You need a big Ops team to do anything sustainable.
  • A big question: Is social media advertising even worthwhile for folks on the long tail (in their category)? Is there even enough worthwhile volume of chatter for leads or otherwise?
  • Social media analytics – NLP, big data crunching etc. is quite erroneous and noisy. The Signal-to-Noise ratio (Read Nate Silver) is quite low.  To get real value, one really needs to do a lot of work. The big question is: Is this all worthwhile ?
  • Finally, mobile and social are quite conflated – as every day consumers try to figure out the modalities of what works best for them. Marketers are still grappling with the question which channel to put their spend on.

Answers to many of the above questions are anecdotal with the end justifying the means.  After processing nearly 100 Million tweets (Twitter) and statuses (FB) in specific verticals (over the past 24 months), less than 10% of these tweets have real economic value. These numbers vary by vertical but social media has long ways to go to compete with display or search as marketing channels in terms of funnel volumes. Having hand-labelled nearly 100000 tweets for our engines, I believe there are pockets of worthwhile chatter. Considering the origins of twitter (Short staccato language used by cops, emergency services, dispatch operators),  one wonders how much of “semantic” content can one embed in these tweets. Much of the these streams are awash with brand/publisher driven chatter that finding worthwhile nuggets still remains a difficult problem.

On the upside, I hope as folks learn how to use and exploit this medium as publishers, advertisers and “readers”, the quality of content (UGC or professionally done) will mature, leading to some real value creation.  With the introduction of video, audio and other content forms, newer applications of this channel remain to be discovered. Newer forms of engagement, content types and inter-platform integrations portend richer consumer experiences in the near future.