Though often overlooked, there is in today’s information world a battle in progress, between two views of content. The contenders might be called the “rectangular” or database view and the “hierarchical” or XML view, and they influence virtually every decision related to the computerization of information in society. In what follows, we will try to shine some light on this battle, its source and its very real impacts on our information lives.
The roots of the often tortured relationship between the XML and database worlds go back at least to 1884 when Herman Hollerith’s punched cards founded what became IBM and business data processing with its ubiquitous rows and columns. Some time before that, Morse’s telegraphy code had established serial communications with identifying information embedded in the stream of data.
The result was parallel but separate tracks; one, the row and column world of business data processing and the other, serial recording of what later became text -- and word -- processing. Until introduction of the first relatively affordable business computer in 1959 these two tracks remained separate with business data processing using a growing array of punched card machines while text moved to teletype and paper tape.
As computer power grew through the 1960s, the use of computers in business grew with it as machines, although still pricey, came within reach of even mid-sized firms.
In the late 60s, hardware manufacturers, focused until then on business computing, began to see text processing as well as a potentially fertile market, introducing magnetic tape keyboard devices, video typesetting systems and software capable of composing text using embedded codes to indicate how the content should be rendered. In 1964, for example, the first computer-driven phototypesetter was delivered to the National Library of Medicine
Rise of the Database Management System
During that time, business data processing, struggling with the need to design and develop unique software for every application, began development of standard software approaches to data, most notably the “database management system.” Initially, there were three major entries in this race: the hierarchical, network and relational approaches, but by the mid-80s, Oracle Corp and its Relational DBMS won in a no-holds-barred battle to control how most data would be stored and accessed.
Text Finally Bids Goodbye to Lead Type
Largely unnoticed by the database community, the text world had been busy as well, with IBM’s Charles Goldfarb working on a markup scheme to identify the logical structure of content in addition to its visual processing as had been the case with IBM’s GML and RCA’s Page-1.
In 1974, Goldfarb and his team came up with “Standard Generalized Markup Language” or SGML, enabling development of complex logical maps for hierarchical data using embedded tags to define, and enforce their logical structures. SGML became an international standard in 1986 (ISO 8879-1986) after several years of informal growth in the defense and aerospace communities.
Text Makes its Presence Felt: the Content Management System
In the early 1980s, in response to the growing volume of text content that needed managing, a new tool came on the market: the Content Management System or CMS. With XyVision shipping its first “Parlance Document Manager” (PDM) in 1983 and with other firms jumping on board, the CMS market grew rapidly in response to the swelling amount of hierarchical content being created and published. For the most part, these systems consisted of a relational database with an application layer designed to keep track of document objects.
Some systems, like PDM, included specific applications to handle SGML content based on a DTD or SGML document map, allowing users to deal with hierarchical content in a manner at least close to its original structure. All CMS software, however, used BLOBs (binary large objects) to store SGML content beneath the level at which it could be fragmented, “chunked” as XyVision called it. Although improving the handling of hierarchical content and helping the text market grow, CMS architecture still tended to mask the fact that no one was dealing directly with the nuance and complexity of text material.
Underscoring the growing importance of text in the mid-80s the database world, Oracle in particular, moved to co-opt SGML (and perhaps head off CMS competitors using other databases) by claiming that its soon to be released database version would “handle” SGML content. That didn’t work out well, partly because with all its brilliance SGML was just too ponderous for true efficiency with the computing resources of the day.
It seemed that Goldfarb may have done too good a job of including every possible eventuality in the SGML standard.
New Kid on the Block … XML
The development of XML, partly driven by frustration over the weight and complexity of SGML, freed tagged data from the overhead of the SGML standard and dramatically accelerated the growth of text by making processing of tagged content possible in Web browsers and smaller, less expensive local devices.
Given that what the database world had labeled “unstructured” makes up a majority of all computer-accessible data, market forces demanded that the information community finally take notice of those crazy people in the text world. Suddenly, budgets for computer hardware and software, previously the near-exclusive bailiwick of the database world, were being driven at least partially by the needs of text, Web and undisciplined users, none of which the pure relational model handles particularly well.
Unable to continue its policy of ignoring text and hierarchical content, and with the “unstructured” moniker becoming less credible, the database world tacitly agreed to begin calling XML content “semi-structured,” holding that because its content couldn’t be easily imported into relational tables and rows, XML couldn’t be called truly structured.
Whatever truth that position may contain from a purely technological perspective, its impact on the world of content and its management went far beyond, promulgating the implicit assumption that XML, because it couldn’t be called structured, must actually be unstructured.
The “semi-structured” label also leaves the impression that while XML can be valuable, it can be truly useful only when it is imported into a relational database for “real” processing. “Do what you want with XML” the message implies, “but buy a RDBMS anyway if you are serious.” Not a bad marketing position if you’re selling relational database software but a heavy burden if you’re developing or selling XML software and vying for budget with the RDBMS your potential client thinks he must have if he is to take the XML plunge.
Which Way Forward?
From the proverbial fifty-thousand foot level, this entire era might be viewed as mere bumps in the evolution of new and innovative approaches to data, but in the real world of people, organizations and content, text and XML’s status as “semi-structured” retards our ability to leverage the hierarchical content making up a significant and growing majority of all computerized data, and with it our ability to fully leverage the best work of both our technological and content creation communities. By forcing virtually all content through the RDBMS thinking and software sieves, this view limits what we can expect from content, and it makes what we do accomplish more complex and expensive.
One thing is for sure: we didn’t get here overnight, and we won’t reach a better state where textual content is concerned overnight either. A lot of water has passed under that bridge and with it major impact on what people believe -- and assume -- about how best to handle content. Adopting a technological environment that fully leverages hierarchical content in today’s world, is a risky step for both organization and decision maker.
But there are some bright spots: while there has for some time been active development of “native XML databases” in the open source and second level market, in the past several years two major CMS vendors, EMC2 Documentum and XyEnterprise (now SDL), have added native XML processing to their CMS offerings, and their marketing narratives.
Movements like this, although still far short of threatening the perceived primacy of the relational database for hierarchical content, indicate an evolution in thinking that portends an alternative path for organizations dealing with hierarchical content without forcing them to leave the relative safety of industry leading firms.
All of this can’t some too soon. If society is to fully realize an information destiny in which every potential participant and every form of content is fully supported in its best and most flexible manner, we must recognize that from their beginnings in the 19th century both rectangular (business) and serial/hierarchical (text) means of recording content had their goals, their genius and their right to a full measure of human creativity.
While the uneven evolution of technology, especially hardware power, temporarily favored one or the other of these modes, today’s information thinking should start from scratch, at least conceptually, allowing each to stand on its own and receive a full measure of innovation and investment for what it does well. While the hierarchical side is behind in claiming its place on the podium, it has the theory and technology to catch up rapidly given the chance and support.
If we fail at this, the information world won’t end for sure, but we will continue to handle information, our most powerful resource, with one arm at least partially tied behind us.
Editor's Note: Barry's articles are always thought provoking. To read another, why not The Connected World: This Decade's Dot Com Bubble?