The information world is clamoring for better access to information of all types. Given that what comes out of the delivery pipeline depends on the ability of functions across the entire content life cycle -- design, authoring, identification, management and delivery -- to work efficiently, significant shortfall in any function is cause for concern. 

Authoring from a Content-Centric Perspective

If there is a single word for where problems usually begin and must be addressed, it is “authoring,” the constellation of people, activities and tools responsible for capturing the rational thoughts of subject matter for the growing array of target forms and venues.

From the delivery perspective, everything will work if we can just get -- and afford -- richly tagged content using standard, widely supported schemes with embedded semantic and other types of finding aid data, in an easily searched content store ready for users’ queries.

If your primary interest is the content generated by authoring, it makes sense to concentrate on authoring functions that directly impact the data generated: the human author, the format in which the content is to be captured and the tools used to perform that capture.

First: the human author

News reporters, engineers, scientists, historians or what have you, working individually or in groups. These folks run the gamut of society’s disciplines, bringing with them the particular approaches to their work common in their communities, and except for a minority using WordPerfect and a smattering of other tools, they overwhelmingly create their content using Microsoft Word.

Authors like word processing’s ease of use and flexibility, but most of all they like the absence of discipline on their work -- you know; do the entire document using the default “normal” style and you can still make it look OK. Indeed, getting them to act congruently to capture their intellectual product in any consistent manner is a challenge from the outset.

In some areas, technical documentation for example, authors are hired to create content in the form desired, using the tools provided and upgrading their skills where necessary to become and stay productive. A challenge in itself, this is orders of magnitude easier than working with subject matter experts who view their role as thinking and writing, not mastering new capture technologies.

So we are well advised to remember that our intended participants won’t necessarily be cheering us on to change their working lives.

Second: The Format in which the content is to be captured

Today, the most important element of our subject is the content moving from creator to consumer. Change the people; change the tools; change the vendors or delivery medium: it still works if somehow the desired content can be created and moved through the system. But change the content in a major way and things may grind to a halt until the downstream functions can be reworked to handle the new formats.

XML to the Rescue?

Today’s odds-on favorite for life cycle effectiveness is content marked up in XML and its various associated protocols. Properly deployed, XML content can carry a wealth of information about the subject, its source, its logical organization and how it can best be located and delivered. No other lexicon is likely to fully meet this set of requirements, objections from various technical communities notwithstanding.

So, we might ask: “in a Microsoft Word-based world, how can we get the XML content we need?” XML authoring works best when the involved industry develops and validates content standards which are then supported by industry vendors giving adopters a wide range of available tools and support. This has worked in a number of major industries, and, even with only partial implementation, it can work in today’s information delivery world.

It will, however, require imposition of an increased level of discipline on authors and on the authoring process.

Finally: the Tools

If authors, especially subject matter experts, typically do their work in MS Word, might we look to Word for some sense of the level of content creation available to us and what we must do to produce final deliverables?

Microsoft and MS Word: Boon or Bane?

Word, despite its high level of flexibility and functionality for its users, uses a tightly controlled internal data model based on linear objects it calls “paragraphs,” defined as content between hard returns -- tables of course are a different matter and are handled with unique formats.

This is important because much of today’s content destined for web delivery is hierarchical in nature: elements nested within each other so that each level acts as the child of its parents and recognizes siblings with the same parentage. While XML is designed to record and leverage this hierarchical nature, Word’s underlying data models do not easily record it nor impose rules for what may or must be nested within what.

Word and XML: A Tortured Relationship

Accordingly, as the web delivery world demands increasingly nested content, systems based on Word have difficulty mapping between initial input files and needed deliverable formats; indeed, an entire sub-industry has grown up to provide tools and services to make the necessary translations, mostly from Word to XML. The development and adoption of the Open Office XML standard -- interestingly, by Microsoft -- has made the intersection between word processing and XML somewhat more transparent, but it hasn't done much for hierarchical content.

Nevertheless, the demand for richly structured hierarchical content grows; the XML standards are in place; the specific lexicons are in use, but the word processing world, especially Word, is still having difficulty fully embracing them.

True, Microsoft hasn't completely ignored XML, releasing a version (Word 2003) that could use and store XML in custom schemas, and also using XML in Word 2007 as the underlying format for Word documents (.docx and OOXML). While not making Word into a truly usable structured content tool, this at least signaled that XML had penetrated the halls of Renton.

Then, in Word 2007, Microsoft ran afoul of a decade-old patent by Canadian firm Infrastructures for Information (i4i), losing a US$ 300 million dollar judgment and being forced to issue an “update” to remove its “custom XML,” returning Word essentially to its pre-2003 XML state. While MS Word still uses proprietary XML tagging schemes for their files (docx, OOXML) it no longer supports use of custom XML markup schemes and, if history serves, won’t any time soon.

A (thin) Shelf of Options

So what to do if our future includes the demand for better information products and an authoring community mired in word processing hence not fully marching to our tune?

There are a few options, all of which involve imposing an enhanced level of discipline on authors and the authoring process:

1. If your authoring communities are willing to use an editing tool capable of creating the nested content you need, you have the most options: (oversimplifying somewhat) investigate and select a native XML authoring system like PTC Arbortext, Altova Authentic or a dozen or so others, develop your life cycle flow, train your people and begin creating the content your delivery folks want. Congratulations… you are among the fortunate few.

2. If you don’t fall into this somewhat rarified class and your authoring community, for whatever reason, is firmly attached to word processing, you still have options:

  • You can work with your authoring community to discipline their word processing, using tools like templates, custom styles and applications to help with complex content forms. Then you can install a tool to convert your word processing files to the XML you need. You will have to evaluate and clean up the converted XML; the better disciplined the word processing, the less the cleanup and vice versa. The XML you get is never quite as elegant but it’s usually good enough to move forward.
  • You can try to make your word processing software act like an XML structured editor by installing one of the available add-on software packages -- most designed for use with Word -- hoping to allow your authors to keep their word processing tool while you get the XML you need. This can work, but is likely to demonstrate that what authors really resist giving up is not the particular word processing software but the freedom from discipline available in raw word processing, often resisting the add-on as aggressively as giving up word processing altogether.

3. You can accept the word processing files just as your authoring communities create them, doing whatever it takes to rework them into the XML you need. This is usually expensive and messy, but it is also the lowest impact course in dealing with your authors and in some cases is the best of a difficult situation. Convince at least some of your authoring groups to shoulder the responsibility and cost of this rework and you duck part of the bullet.

Discipline: the Magic Ingredient

In the end, with the exception of option three, the key ingredient in whatever you do will be increased discipline in the authoring process; to authors, an extra layer of intellectual overhead that virtually always generates resistance that no level of software or amount of funding will fully ameliorate.

But if you want disciplined output at reasonable cost, you must have discipline in your input, so however you plan to stay abreast of your information world, begin the authoring discipline process now. It can be done but it will take time… time you may have now, but won’t have when your audience comes banging at the door; demanding more and better information products, and in no mood to wait while you ramp up.

Editor's Note: Interested in more of Barry's perspectives on the importance of XML? Read The Battle for Data Supremacy: The Cost of Ignoring XML