Big data, smart data, call it what you will, it's time we face a timeless truth: Machine readable data from heterogeneous sources is only valuable if it can be understood.
What that means, and what some big data champions are belatedly recognizing, is that absent the ability to know or easily find out what the creator meant, all that data is limited in its ability to inform.
What's Old is New Again
You could argue that Samuel F. B. Morse and his associates understood this in 1836, but that might be pushing the connection.
But IBM definitely understood this at least as far back as the late 1960s, when it hired Charles Goldfarb to develop a markup language capable of describing both the structure and meaning of data, and a machine readable map to accompany it: Standard Generalized Markup Language (SGML) with its Document Type Definition (DTD).
XML, the current international markup standard, grew from a desire to simplify SGML syntax, but is based on the same mission of making data understandable, to achieve “interoperability.”
Why would something that took place more than 50 years ago be important today? And why we didn’t anticipate this need before we began saving every data point in sight?
The answer may lie in our fascination with those growing mountains of data. We were convinced that with big data we had reached a new horizon and just needed to let it revolutionize us.
There’s also the sad fact that every new generation loathes admitting that it can learn anything from its troglodyte predecessors.
But we should have remembered that just collecting all that heterogeneous data wasn’t going to get us where we wanted to be. No matter how heavy the analytical processes we used to interpret the data's meaning, there are limits beyond which it was neither productive nor safe to go outside the statistics lab.
We are approaching that limit.
And we're reaching that limit well before the endeavor fully justified its cost, complexity and error factor. And when you factor the coming Internet of Things into the mix, the big data world rapidly gets both bigger and more heterogeneous.
So we rediscover the importance of consistent, understandable data. But we've probably wasted valuable time and resources chasing a goal we couldn’t possibly reach.
What Do We Do Now?
Having belatedly realized that we need ways to share both data and meaning, we should learn from others who have been dealing with the challenge for a long time, who have developed ways of achieving broad — if not perfect — interoperability.
One thing we can see from previous efforts is that it is incredibly difficult to achieve a single notation or vocabulary across the many communities that create and need data. Over time, the different industries, governments and others producing and gathering data have found ways to balance the desire for unlimited interoperability with the need to serve their own community in a timely fashion.
For example, in the building and facilities management community managers need the ability to centrally monitor and control the many mechanical and electronic devices that support a large building; hence the development of MIMOSA (Machine Information Management Open Systems Alliance) and a standard way for industrial devices to communicate their status and condition in a common machine readable form.
While not the only aspects of building operation that need monitoring, the machine and device communities reasoned that their products were sufficiently unique to have their own language, so they got together and created one. As a result, a large percentage of mechanical and electronic devices in the building trades communicate in a standard form, making central monitoring and control both feasible and cost effective.
The MIMOSA example demonstrates that properly defined and limited, a community can achieve data interoperability. And that once a common language is achieved and enforced in one major component of the data world, it can interoperate with others via translations. MIMOSA-compliant devices, for example, interoperate easily with broader Computer Maintenance Management Systems (CMMS) responsible for overall control of building operation and maintenance.
Finally, MIMOSA’s status as a voluntary industry alliance rather than a government standards body, tells us, as Adobe’s co-founder John Warnock has said, that the bureaucratic overhead of government isn’t always — Warnock would say “ever” — needed in order to achieve standardization.
No Time to Waste
While many efforts are underway in the world that produces most big data stores, they labor in relative obscurity. A 2015 report on big data by the International Organization for Standards (ISO/IEC) makes clear that while a significant amount of work is ongoing in areas that relate to big data standardization, it's still more in the planning, design and testing phases than implementation stages.
What we need now is a more energetic effort to finalize workable standards in the areas most impacted by big data. We need to make them smarter and easier to work with. This is critical because business's collection of data is proceeding at an accelerated pace, creating mountains of data that could be much more useful if it spoke a common set of languages.
Smart Data, If It Happens, Will Be User Driven
Industry leaders and standards bodies can drive this effort towards standardization to a certain extent, but standards in any area are useful only to the extent that software and hardware makers include compliance in their products.
Given the bottom-line obsession of the technology industry, this will only if the users — all those firms, agencies and organizations who need the answers big and smart data can provide and who buy the tools to get them — demand a level of standardization and interoperability in the tools and techniques on offer.
Call it what you like: Smart Data, Content Standards, Interoperability or what have you, the writing is on the wall. We need usable data across the technology world. It’s up to us to read a little history and then make it happen.