A recent HP Labs study examines how to predict the popularity of an article prior to publishing it online or promoting via social media. The researchers were able to estimate ranges of popularity with an overall accuracy of 84% by considering only article content features.
In The Pulse of News in Social Media: Forecasting Popularity, HP Labs researchers look at previous studies, which use early measurements of an article's popularity to predict its future success, but then they take a different approach to the research:
In the present work we investigate a more difficult problem, which is prediction of social popularity without using early popularity measurements, by instead solely considering features of a news article prior to its publication. We focus this work on observable features in the content of a article as well as its source of publication. Our goals is to discover if any predictors relevant only to the content exist and if it is possible to make a reasonable forecast of the spread of an article based on content features."
The study is written as an academic paper instead of an article, so it provides details about the methodology, previous works, data sets and scoring. The researchers gathered their news data from Feedzilla and measured the spread of the articles using Twitter. They determined "social popularity" by how many times a news URL was posted and shared on Twitter.
To determine an article's "features," the researchers considered the news source that generated and posted the article, the category of news the article fell under, the subjectivity of the article's language and named entities in the article.
"Our experiments show that it is possible to estimate ranges of popularity with an overall accuracy of 84% considering only content features," the researchers report. "Additionally, by comparing with an independent rating of news sources, we demonstrate that there exists a sharp contrast between traditional popular news sources and the top news propagators on the social web."
How to Be Popular
The study didn't provide any major bombshells. In a nutshell, the researchers concluded that the most significant feature of an article is its source, followed by its category. For example, a technology-related article on Mashable will perform better than a health-focused article will, but a technology-related article on a lesser-known website will perform worse than one on Mashable. But we all knew that already.
Whether an article has named entities (celebrities, for example) or is written subjectively didn't help with the predictions. The researchers concluded that readers don't show a preference to subjective or objective language in news stories.
One interesting thing the researchers did point out is that articles that spread in "medium" numbers, rather than going viral, should not be overlooked because those articles can target highly interested and informed readers. Still, this isn't a major revelation. Although the report looks impressive with its detailed methodology, sources, figures and tables, it doesn't provide any "ah-ha" revelations that will go viral on Twitter.