Natural language processing and sentiment analysis have been popular artificial intelligence (AI) research topics for decades now.
Early sentiment analysis efforts were typically applied to significant bodies of text, like movie or book reviews. In today's intelligent digital workplaces, however, we are becoming entranced by the potential of AI chatbots and the use of AI to actively participate in human led conversations.
Assessing Conversational Threads
Short and sharp twitter like exchanges, however, provide much less material for AI engines to work with. This can result in unintended negative consequences, as the Microsoft Twitter chatbot found recently.
Perhaps we're still quite far from being able to welcome a chatbot into our day-to-day conversations, but would a less ambitious goal of assessing the sentiment contained in a discussion now be within AI’s grasp?
Using existing sentiment analysis techniques to assess conversational threads adds some significant additional challenges.
Firstly, the amount of text is usually much less and much more succinct. Secondly, here is more than one speaker, so there is likely to be a mix of sentiments being expressed. Finally, there is context between speakers as they interact, which also has to be considered.
Because my team and I at SWOOP deal in enterprise level conversations, our clients often raise the topic of sentiment analysis. In fact, it is something we have been monitoring for some time.
In this article, I'll share some of my initial findings from testing some of the sentiment analysis solutions, whose vendors have been brave or confident enough to offer an online evaluation facility.
Sentiment Analysis Evaluation Set Up
From my early experiments, it was clear that all of the offerings could characterize positive conversations reasonably well.
It is nice to know that our Enterprise Social Networking (ESN) is facilitating positively reinforcing and polite online conversations, but for many of our clients, it is the early detection of negative sentiments that would provide the most value.
My quest for some negative conversation threads drew a blank on our own ESN — we are just too nice to each other! I had, however, seen some great examples online during one country’s recent national election.
It didn’t take long to find a short conversational thread to use for my testing:
Eric: “arrogant cock. Fish outta water at the BI.
Jim: “You have spoken some true words El, few truer that those. X”
Mary: “I think I said worse to his face ... Oh Dear...”
Why did I choose this thread? Well firstly, I sense that most humans would have little trouble assessing the sentiment as totally negative.
Secondly, it contains colloquialisms, shorthand and misspellings that are typically found in online conversations between familiar participants.
Thirdly, I could see that there was a nuance here where Jim’s response to Eric was reinforcing the initial negative statement made by Eric — yet on its own, it is a positive statement.
My testing procedure, therefore, was to feed the whole thread into the different products and then retest them one statement at a time.
This is not an exhaustive market study. I only assessed those products that offered an immediate test facility. Here are the raw results, which I will discuss afterward:
|Lexalytics||Negative (-0.6 strength)||Negative (-0.6 strength)||Neutral (0.0 strength)||Neutral (0.0 strength)|
|Microsoft Cognitive Services||Neutral (50%)||Slightly Negative (45%)||Positive (84%)||Negative (21%)|
|ParallelDots||Positive (97%)||Positive (76%)||Positive (96%)||Negative (3%)|
|TextAnalysisOnline||Slightly Negative (-0.08)||Neutral (0.0)||Slightly Positive (0.07)||Negative (-0.4)|
|They Say||Negative (0.668)||Negative (0.82)||Positive (0.776)||Negative (0.948)|
|Twinword||Negative (-0.099)||Negative (-0.37)||Neutral (0.03)||Positive (0.105)|
|* Non-vendor research institution|
Each of the vendors had different ways of assessing and valuing the degree of sentiment. Most vendors provided a scale between -1.0 and + 1.0, or 0 percent to 100 percent to rate their sentiment assessments.
In terms of negative sentiment, I would have to say that nearly all performed poorly compared to my human assessment. However, when the sentiment was positive, as evidenced by Jim’s statement, they did reasonably well. The main points from my assessment are:
- Sentiment analysis works far better for positive, rather than negative sentiments
- None of the vendors explicitly addressed conversational contexts i.e. recognizing that Jim’s statement was actually reinforcing a negative prior statement, and therefore, in reality, was negative
- Aggregating separate statements into an overall assessment appears mostly, but not always, simplistically averaging across all statements i.e. no recognition of chat context (to be fair this would be a big challenge in itself)
- ‘Neutral’ zero weight scores appear to be given when “I’ve got no idea how to assess this”; and finally
- Negative slang terms like Eric’s are not really recognized or rated, compared to say Mary’s statement, that while also negative, was rated much more negatively than Eric’s
Is There a Standout Winner?
I would hesitate to call an out and out winner, given the limited scope of my test and the mixed results achieved.
Also, no doubt, many vendors could improve their scores through tuning their solutions for conversational text, something we intend to explore further.
That said They Say does have some appeal. Not only did their assessment have the closest fit with my own assessment (accepting the shortcomings mentioned above), they had the most colorful way of communicating their results.
I have also included the results from AI Research pioneer Stanford University, not because of the actual result, but because of its approach that I feel has most potential to be extended to work better in the "Chat" context.
Stanford’s use of "Sentiment Trees" linking words and concepts could be reframed to also connect statements within a chat thread, providing the "missing link" for sentiment analysis effectiveness for conversations.
Some Final Remarks
Despite some of the shortcoming identified in this article, especially with respect to negative sentiment, virtually all vendors offer scales of negative to positiveness and API access to their software, making it easy to access the technology.
In most cases, identifying relative sentiment can suffice. I would recommend running some tests of your own to see whether the sentiments provided are discriminating enough for your own tastes.