The pace of digital information has resulted in the world's aggregate data doubling in size in shorter intervals than ever before. According to Gartner, about 80 percent of data held by an organization is unstructured data, made up of information from customer calls, emails and social media feeds. Add to this the large volumes of diagnostic information logged by embedded and user devices. While it's difficult to make a proper analysis from organized data, making sense of unstructured data comes with its own challenges.

Organizations now have to study structured, semi-structured and unstructured data sets to arrive at meaningful business decisions, including determining customer sentiment, cooperating with e-discovery requirements and personalizing offers for their customers.

But while sifting through vast amounts of information can look like a lot of work, it comes with rewards.

By reading large, disparate sets of unstructured data, one can identify connections from unrelated data sources and find patterns. What makes this method of analysis extremely effective is that it enables the discovery of trends.

9 Steps to Extract Insight

There are nine steps to analyze unstructured data so that one can see more than what meets the eye:

1. Make sense of disparate data sources

Before you start, you need to know what sources of data are important for the analysis. Unstructured data sources may range from web logs to voice files to emails to chat transcripts to streaming videos. If the information being analyzed is only tangentially related to the topic at hand, it should be set aside. Only use information sources that are absolutely relevant.

2. Sign off on the method of analytics and find a clear way to present the results

If the end requirement is not clear, the analysis may be useless. Understand what sort of answer is needed -- is it a quantity, a trend, cause and effect or something else? In addition, provide a roadmap for what to do with the results so that they can be used in a predictive analytics engine before undergoing segmentation and integration into the business's information store.

3. Decide the technology stack for data ingestion and storage

Even though the raw data can come from a wide variety of sources, the results of the analysis must be placed in a technology stack or cloud-connected information store so that the results can be easily utilized. Factors that are important for choosing the data storage and data retrieval depends often on the scalability, volume, variety and velocity requirements. A potential technology stack should be well evaluated against the final requirements, after which the information architecture of the project is set.

A few examples of key business requirements and the respective mapping of the technology stack are:

  • Real time: It has become crucial for e-commerce companies to provide real time quotes. This requires tracking real time activities, and providing offerings based on the results of a predictive analytic engine. Technologies that can provide this include Storm, Flume and Lambda.
  • High availability: This is crucial for ingesting information from social media. The technology platform used must ensure that no loss of data occurs in a real time stream. It is a good idea to use a messaging queue to hold incoming information as part of a data redundancy plan, such as Apache Kafka.
  • Multi-tenancy: Another critical dimension is the ability to isolate data and resources from different groups of users. Effective Big Data solutions should be able to natively support multi-tenancy situations. Given the sensitivities around customer data and feedback coupled with the criticality of insights, isolation is extremely important as it is often needed in order to meet today’s confidentiality requirements.
  • Unstructured web logs or security logs: These require flexible schema’s to hold the data. HBase /Cassandra with flexible column families could be explored.

4. Keep information in a data lake until it has to be stored in a data warehouse

Traditionally, an organization obtained or generated information, sanitized it and stored it away. For example, if the information source was an HTML file, the text might be stripped and the rest discarded, such that information was lost during storage in a data warehouse. Anything useful that was discarded in the initial data load was lost as a result, and the only thing one could do with the data was what is possible after extraneous information was stripped away. The appeal of this prior strategy was that the data was in a pristine, mutable format that could be used whenever. With the advent of big data, common practice is to do the opposite: with a data lake, information is stored in its native format until it is actually deemed useful and needed for a specific purpose, preserving metadata or anything else that might assist in the analysis.

5. Prepare the data for storage

While keeping the original file in case you need to use the data is a good practice, it is best to clean up a copy. In a text file, there can be a lot of noise or shorthand that can obscure valuable information. It is good practice to cleanse noise like white spaces and symbols, while converting informal text in strings to formal language. If it is possible to detect the spoken language, it should be categorized as such. Duplicate results should be removed, the dataset treated for missing values, and off-topic information expunged from the dataset.

6. Retrieve useful information

Through the use of natural language processing and semantic analysis, one can make use of Parts-of-Speech tagging to extract common named entities, such as "person," "organization," "location" and their relationships. From this, one can create a term frequency matrix to understand the word pattern and flow in the text.

7. Ontology evaluation

Through analysis, one can then create the relationships among the sources and the extracted entities so that a structured database can be designed to specifications. This can take time, but the insights provided will be worth it for an organization.

8. Statistical modeling, data and text mining

Once the database has been created, the data must be classified and segmented. It can save time to make use of supervised and unsupervised machine learning, such as the K-means, Logistic Regression, Naïve Bayes and Support Vector Machine algorithms. These tools can be used to find similarities in customer behavior, targeting for a campaign and overall document classification. The disposition of customers can be determined with sentiment analysis of reviews and feedback, which helps to understand future product recommendations, overall trends and guide introductions of new products and services.

The most relevant topics discussed by customers can be analyzed through temporal modeling techniques, which can extract the topics or events that customers are sharing via social media, feedback forms or any other platform.

9. Visualize, implement and impact measurement

From all the above steps, it all comes down to the end result, whatever that might be. It is crucial that the answers to the analysis are provided in a tabular and graphical format, providing actionable insights for the end-user of the resultant information. Render the information in a way that it can be reviewed through a handheld device or web based tool, so that the recipient can make the recommended actions on a real time or near real time basis. Scientific implementation methods like design of experiment (test and control), baselining and process/continuous improvement framework holds the key to success. The final step would be to measure the impact (ROI) -- in both hard (dollars) and soft (process efficiency and effectiveness, productivity improvements, etc.) terms.

Aim for the 360 Degree View

The real value lies in combining structured, semi-structured and unstructured data analysis for a 360 degree view. A case in point: while structured data analysis can predict customer behavior, unstructured data analysis can unravel reasons for such behavior. New information forms such as social media and machine logs have proved crucial to organizations for their ability to provide unique content and diagnostic intelligence once properly analyzed.

Traditional or conventional data scientists will have to acquire new skills sets to analyze unstructured data. While enterprises develop content intelligence capabilities, the real power lies in fusing different data formats and overlaying structured data with semi and unstructured data sources for insights into the mind of a user or the life of a device.