On September 4th, the President took another important step toward a more open and transparent government by announcing a new policy to voluntarily disclose White House visitor access records.
Aside from a small group of appointments that cannot be disclosed because of their necessarily confidential nature, the record of every visitor who comes to the White House for an appointment, a tour or to conduct business will be released. As historic as the President’s announcement is, it is also a good illustration of what is missing from the administration’s technology infrastructure plan -- a coordinated approach to providing data standards.
On the surface, this new disclosure of visitor data looks perfectly fine. The data made available in a simple CSV file (download) is easily downloaded and opened into a spreadsheet for viewing purposes.
Take a step beyond simple viewing, and try to mash up this content to see where the visitor’s list collides with other interest groups and data sources -- you begin to get an idea of the complex nature of data mapping. For example, think of mashing up this visitor information with the U.S. SEC filings that include the names and remuneration of executives of publicly traded companies tagged in XBRL.
Better yet, simply try to blog about someone’s visit to the White House and reference a snippet from the .csv content. Then go to Twitter and post a tweet with a link to your blog so you can have bragging rights about being the first to notice some VIP’s visit.
If I then repost the information on my blog and one of my readers wants to get back to the source file to verify the facts without some form of metadata and URI associated with the content, there is no path back to the original source. Therefore, there is no validation that the information is accurate. When I repost your information on my blog, I am simply trusting your cutting and pasting skills and trusting that you accurately interpreted the information. This can be a potentially dangerous situation that often leads to a lot of misinformed "noise".
So far, in the marriage of social networks and open government, there has been a lot of "noise" coming in, but there has been very little done in the way of creating constructive solutions for accurate and trusted citizen participation.
Without the metadata about the newly disclosed visitor content or any other government information, the accuracy with which data is interpreted is jeopardized with each reuse. Without a link back to the source, the authenticity of the content is no longer discoverable. Without this information, it’s all just more “noise” on the web.
Where Does XML Fit in?
XML industry standards bring metadata to the content. Even a simple XML schema and an instance document would have gone a long way to ensure that, regardless of what tool consumed the visitor data (including spreadsheets), the information would always be interpreted in the same manner. Furthermore, the use of an XML industry standard for identity would enable one to leverage existing tools to mash up the content with other data sources. The key benefit of XML is that consuming applications no longer requires someone to reinvent clever ways of mapping and representing complex data, so developers can expend their energies on solving higher level problems that have a greater return.
There are plenty of other examples across federal, state and municipal government agencies that build the case for leveraging XML industry standards to aid in creating greater transparency and to create efficiencies for the agencies themselves.
Where Do We Go from Here?
Recovery.gov and multiple other individual government agency projects have taken strides forward to granting the public access to government data. However, cross-agency conversations are still taking place to get some agreement on common data models for comparing and mashing up information from multiple data sources accurately.
Efforts such as the NIEM XBRL harmonization discussions should be applauded as this combined effort should aid in the accurate mapping of government financial data across agencies. There is still a long way to go before we can start to leverage the really interesting technologies like Resource Description Framework (RDF) and the Semantic Web.
While everyone wants to jump on the Web 2.0 bandwagon, designing the technology infrastructure to ensure that it is done in an open, transparent and accurate manner requires a lot of cross-agency collaboration. The administration’s goal should be to ensure that the public can collaborate on the analysis and dissemination of public information across the web in a manner that can be trusted, authenticated and redistributed without imposing a cost burden on the consumers or the producers of that information. That is no small task.
This all leaves me wondering if I am guessing correctly about what was being talked about in the White House on 7/14/2009 at 3:00:00PM and about who was in the room. If my assumptions are right -- loosely based on about 22,200 Google hits for Stephen J. Hemsley, who was listed as visiting Aneesh Chopra, for whom there are about 1,170,000 Google hits -- I’m guessing a lot of these same data topics were addressed with a slight healthcare twist. But then again, I’m doing the interpretations here and making the free associations, so you’ll just have to trust me.
To add your input to the conversation about improving data access on the web, join us at the Workshop on Improving Access to Financial Data on the Web on October 5-6, 2009, in Arlington, VA, that is co-organized by W3C and XBRL International, Inc. and hosted by the FDIC.