While attending TechEd Europe in Amsterdam last month, I was able to attend a session on the topic of big data by Gert Drapers, a principal software architect on the SQL Server team at Microsoft. His session, entitled "Big Data, Big Deal?" was based on content by David J. DeWitt and Rimma Nehme from their SQL Pass 2011 keynote, and sought to clarify the definition of big data and Microsoft's role in the space.

What We Mean When We Talk Big Data

Big data means a lot of things to a lot of people, with new terms and acronyms entering the market every month. Drapers defined it as being about high volumes of machine-based data, typically hosted on large clusters of low-cost processors, with the intent of slicing and dicing the data for competitive information.

While the problems of big data are nothing new, the ability for more and more organizations to take advantage of their data has changed dramatically over the past decade. Companies are capturing more and more data -- and IT teams are being tasked with how to deal with this data, how to make use of it. An example he shared was cell phone data: your cell phone is emitting location data -- that constant ping of data is being collected, which can be read and correlated to other data sources and activities.

Drapers shared some interesting statistics:

  • 966 petabytes stored by manufacturing industry as of 2009
  • 848 petabytes by government
  • 715 petabytes by communications
  • Growing to 35 zettabytes (1 zettabyte = 1 billion terabytes) within the next 5-8 years

The shift in thinking across many organizations was the realization that data was "too valuable" to delete, even if current hardware and software had limited ability to do something with the data. But with hardware and storage costs dropping, using this data is becoming a reality for more and more companies.

NoSQL Stands For More Options

Part of the shift happening is due to the NoSQL movement -- which is not about not using SQL, but about utilizing SQL as well as other alternatives. The idea is that not everything can be solved by relational databases. Expanding the way data is collected and stored gives organizations more data model flexibility, and provides relaxed consistency models such as eventual consistency.

Attributes of a NoSQL system:

  • Unstructured data without schema
  • No ACID
  • No transactions
  • No SQL (or not just SQL)
  • Eventual consistency (Relational/Structured is very consistent)
  • No ETL
  • Faster time to insight (Relational takes longer)
  • Flexibility (Relational is mature)

A core message of Drapers' presentation was that RDBMS is no longer the only game in town, and that SQL people need to understand this and start looking at Hadoop. For those unfamiliar with it, Hadoop started at Google back in 2003, capturing massive amounts of click stream data to be stored and analyzed.

They created Hadoop and MapReduce (MR) to allow scalability and a high degree of fault tolerance, allowing the company to quickly analyze mass collections of records without forcing data to first be modeled and cleansed. Written in Java, Yahoo! has also been a major contributor, using Hadoop extensively across their properties. Drapers pointed out that Hadoop is not a paradigm shift, but an evolution in how data is captured and used.

Microsoft and Hadoop

In October 2011, Microsoft announced their support for Hadoop with integrations to SQL Server and Windows Azure, their cloud-based service for hosting and scaling applications. Microsoft is showing their support for Hadoop with the following:

  • Apache Hadoop for Windows Server and Windows Azure
  • ODBC Driver and Add-in for Excel both for Apache Hive
  • JavaScript Framework for Hadoop
  • SQL Server and SQL Server Parallel Data Warehouse connections for Apache Hadoop
  • Collaboration with Hortonworks
  • Making new investments in this area

You can read more on Microsoft's support for Hadoop here. For more from Gert Drapers, check out this video from TechDays: Hadoop vs RDMS, Gert Drapers on Big Data. You can also find a great synopsis of DeWitt's SQL Pass 2011 keynote on Simaran Jindal's blog, which includes many of the same slides used by Drapers at TechEd Europe.

Editor's Note: To read more by Christian Buckley:

-- Office 365 for the Enterprise, SharePoint in the Cloud #TECHED12