Over the past two years and a half, I have been working closely on social media data from Twitter, FB, Google+ and others. After building and using real world systems that need to scale and having read a number of academic articles and noodling with prototype systems, much remains to be done. As I went through the process, I gathered a working list of worthwhile R&D problems which I’d like to share below.
Computational Linguistics/Semantics problems
How does the “social media” language evolve? Twitter’s 140 character restriction drives a certain type of conversation. FB’s setup allows a different type of register. Tumblr is longer in nature. SMS and WhatsApp conversations are different. How do these conversations look in personal versus professional contexts? Comment chatter on blogs and topical sites is different (with an implicit theme/context). How do these affect language? (evolution of acronyms, social conventions – the back & forth, cues for follow up etc.) How do we detect these post-facto or real-time? What is the nature of one-to-many conversations and nature of “conversational threads” in these different registers? How does Entity recognition evolve to process social media data? How can Relationships, Actions, Events be detected from Social media data? How can sentiment/tone analysis be improved ? How does each of these look in different languages? How can you generate “social media” dialogues (moving beyond chatbots and the like)? How do you do supervised learning@scale ? What can be learnt via labeling via crowdsourcing versus what concepts cannot be? How do you understand aka infer “context” – what does “context” really mean? Is the current Geo labeling (approaches) enough? What kinds of sharing, response behaviors does one see and how are they cued (in text)? Is behavior similar across multiple languages? cultures? (that is do folks “chatter” the same way in English versus Spanish or other cultures?) How does Social media chatter leverage or enable “search” behavior? What if Twitter/FB had come up before Google for search? What behavioral/opinion changes can we detect from social chatter in space and time ? public opinion versus individual opinion?
Media Type problems
How do images/video/audio interplay with text? Role of Image recognition/object id/scene id in the context of social conversations (aka Instagram photos or Pinterest) ? How would we organize audio comments on Pandora ? Stitch images into an animation? How would we cluster “images” ? What kind of fingerprinting technologies can be used to “bundle” or group things together?
What kind of “profile” based inferences can one make from public social profiles and chatter? What kind of “persona” discovery? How can we link identities across social channels? (entity linking problem) How can one link offline/online “profiles”? What kind of “accounts” are spam or spurious accounts? What kind of “chatter” is spam? What kind of chatter is generated by a bot? How can we use “geo” info to know more about a user, or infer geo info about a user from their chatter or provide content to a user based on their geo info? How does one generate “communities” – by similarity amongst users on what dimension(s)?
How do we organize User-generated free form content (UGC)? Topics (Clustering and the like, how good are the approaches?) How do we link UGC with professionally created content? How do we use social chatter to guide “recommendations” ? in what contexts? How do we categorize content? generate taxonomies auto-magically or update manually curated taxonomies on the leaves of a core structure?
Advertising Eco-system problems
What does it mean to advertise socially? Paid media attribution problems – which channel worked for which kind of content at what time for what kind of user? How should the ad copy/message look like? How does social earned media interplay with paid media advertising – say display advertising and PPC advertising. How should publisher’s promote content in social media ? Which order ? What ad copy? Which snippet? How should they link different online/offline properties?
How to identify, track, secure and protect content sharing – paid/free text, images, video? (digital rights, watermarking, analytics at scale). How does “Social” interplay with TV ? (aka like Twitter for large sample feedback on a show or ad) Are the statistical inferences really valid (Is Twitter really representative of the larger population)? How would you test for the same? What kind of surveys/experiments/probes can you run real-time to drive/guide an inference – automated design of experiments ? How do we “simulate” Twitter/FB behavior (structurally and “dynamically”) ? What “dynamics” can we model in such systems? What kind of “actions” can one suggest from such simulations? (Power law behaviors). Standard problems such as spam detection, de-duplication, link duplication, content segmentation, summary generation in the context of social media chatter.