Bill Inmon recently wrote: "For the most part, the worlds of structured data and unstructured data operate as if they were in a vacuum. With a few exceptions, there is no bridge or interface between the two worlds. However if a bridge between the two worlds can be created, it is possible to build entirely new kinds of systems."We suggest that bridge involves utilizing "semi-structured" data alongside structured data to quickly and easily enable pervasive business intelligence with a greatly reduced degree of complexity and cost. There is clear recognition that there exists data that does not fall into the more easily defined categories of structured and unstructured data, but there is no equally clear consensus as to what comprises "semi-structured data." Most mistakenly assume semi-structured data is just another term for XML. Our view is that semi-structured data goes well beyond XML to include a far more plentiful, far more common source of semi-structured data.

What is Semi-Structured Data?

We define it as business-relevant data which does not follow a fixed schema; it does follow in its entirety an overall implicit structure, but may have some irregular structures, and the data is either self-describing -- such as through the presence of labels or headings -- or readily deductible. Using this definition, the primary source of semi-structured data is not XML, but rather, by far, existing reports and business documents, published from enterprise information systems, both within the organization and provided to the organization by external sources. Reports enable the presentation of data in human readable format. In fact, reports and business documents overwhelmingly exceed XML-coded content in the workplace, as evidenced by unabated paper consumption for the past several years.

Why is Semi-Structured Data Important?

This has several implications, not the least of which is cost: the cost savings from leveraging existing reports as a live data source is compelling. In fact, Gartner estimates up to 40 percent of an organization's typical programming budget is spent simply to perform data extraction and combine data from disparate sources. With such a massive collection of data, it is clearly in the best interest of the organization to capture the value of this underlying resource. But how? Virtually every existing data Extraction, Transformation and Loading (ETL) tool and enterprise reporting solution available relies heavily on "structured" data sources, such as "raw" data within production databases. Yet these solutions ignore semi-structured data, particularly data buried within existing reports, such as ERP, HR/payroll and industry-specific information, which are relied upon to fulfill auditing and industry compliance requirements. Further, legacy systems typically contain a vast amount of accounting, operational and transactional reports, rich in data already containing logic and business rules.

Report Mining Presents an Alternative

We propose an approach that effectively uses semi-structured data as a source of BI by recognizing, parsing and transforming it into customized structured data -- with no database programming skills required. This approach, which is called Report Mining, capitalizes on both this vast repository of data, and the economies of cost and efficiency that can be realized by putting this information to work in the hands of the right people across the enterprise who need it most. Report Mining utilizes the data buried within existing reports and automatically transforms it into live data sources, either alone or in combination with additional reports, spreadsheets, databases, PDF files, HTML pages, etc.

The Advantages of Report Mining

Report Mining offers a huge advantage in capturing and customizing the high-value data end users actually need and in providing real benefit to the organization. Why? Because Report Mining provides a credible alternative to companies for the previously mentioned reason -- existing reports make up much of the data within the enterprise and, just as important, these reports are the primary source of important transactional information, frequently from outside partners and vendors. These reports are also a fully reliable and far less costly alternative to providing every employee access to the core enterprise structured data sources. Organizations that proactively go beyond the confines of structured data and build a bridge that incorporates semi-structured data -- specifically, existing reports and business documents -- as a source of programming-free BI will find their efforts repaid many times over. Customers are telling us that Report Mining is that bridge.

About the Author

Michael Urbonas is a product marketing management professional, formerly for the family of Monarch-powered enterprise BI, information delivery and archive solutions (Datawatch Corporation). Please contact Michael Urbonas via his business intelligence and marketing blog at: