Aggregating Social Data Requires a Firehose #SMAS12

SMAS kicks off with some insight into the massive scale of social media data, and the difficult task of aggregating it. Smart folks at Attensity and Gnip are among those attempting to sift through and determine how much can be done by machines and how much relies on human intuition. They are making progress, but is the vast ocean of data too far ahead of us?

There are methods in existence to suck in social media data, shoot it through massive control centers loaded with algorithms and huge screens and attempt to produce some kind of meaningful analysis. These are useful, if not foolproof. They are an effort to incorporate a lot of complex factors, resulting in something inevitably speculative and incomplete. But the tools are improving. The data is there. How much of it can we absorb, interpret and analyze with our meager, limited minds? The truth is, not nearly all of it.

Taking It All In

Companies like Attensity and Gnip (apparently pronounced "gah-nip") are utilizing a firehose -- a nifty name for an expensive tool that allows them to pull in all of the available data. With reps here at SMAS giving a glimpse inside the operations of their business, we've been given a rare and fairly technical view of how the aggregation of social media data can actually work. If it can.

Massive amounts of information are purchased, scraped and discovered. They are funneled through complicated systems which include futuristic-sounding stuff like "Natural Language Processing" (NLP) which can read live posts, then interpret and analyze them. Assuming you've got the resources to harness that power and harvest that information, you've still got plenty to worry about.

People Are Funny; Robots Aren't

The problem with technology like that is that we don't really have it yet. That's why there's sentiment scoring, which is basically assigning numerical value to positive and negative feedback received through social media outlets, then doing some complicated math to determine how well you're doing.

No program can accurately gauge human sentiment, particularly when it comes to concepts like irony, sarcasm and just plain false information. It's also difficult to determine the impact and influence of any given post or tweet, or to consider the reliability and viability of the source. These are things we know intuitively, and may never be able to sort into algorithms. Unless we can all own one of those really smart Watson robots, like the one that was on (and kicked ass on) Jeopardy!

And It Never Ends

The amount of sites producing useful, relevant data is always rising, and the amount of data on each site is constantly increasing. And it's just constant. It never stops. It's a massive ongoing project to stay on top it. It's an operational nightmare. But it is also essential. There's no way this data just sits out there without someone figuring out how to use it. Guys like Attensity and Gnip are getting closer, but from what I've heard, they may be feeling their way through the dark at this point.