Big Data Projects Taking Care of the Foundation

Data and System Design and Architecture

A comprehensive view of data accuracy, validation, quality and consistency will be required to provide confidence in data sources used for big data analysis.  You'll need master sources of data and “gold standard” sources of truth for certain high value processes. Big data programs typically reveal the need for improved data quality and are frequently connected to master data initiatives.

Development of consistent naming conventions to make information more findable and improve access and data availability is another essential architecture element. Harmonize and optimize inconsistent and redundant terms and metadata fields to remediate data quality issues. A formal business glossary and set of BI reference metadata (available in a catalogue, dictionary or repository) is also part of these foundational requirements.

Changes to data models, taxonomies and reference data need to follow a change management process. Siloed data will need to be combined and integrated through extract, transform and load (ETL) processes in order to arrive at new insights and measures. Tools of varying sophistication will help put the power of particular capabilities into the hands of casual business users (search application interface), power business users (more complex querying and analytical capabilities with "what if" scenarios) and data scientists (advanced capabilities with simple coding languages).

Some of these activities are examples of foundational capabilities that do not produce an ROI. These are the table stakes for starting a program -- consistent metadata schemas, curated data, a suite of software for the technical infrastructure.  Representative tools might include Informatica or Talend for data integration, Hadoop for processing large amounts of data, Hive for writing queries, MongoDB for storing documents, R, SAS or Tableau for analysis and visualization. There is of course overlap between these tools and the large providers like IBM and Oracle provide many components in their software suites.

The overall structure of data and data-related resources is an integral part of the enterprise architecture, along with analysis, design, building, testing, and maintenance of data and semantic models. Part of the process is relating questions from the business to available and appropriate data sources. A consistent reference architecture will help tools and platforms talk to each other and simplify data integration. It will include a determination of which platforms and configurations are supported for basic functionality, and which higher-end, more complex configurations are required for learning, testing and experimentation, versus those that are needed for production-ready environments.

Governance, Communication, Training, Change Management

Successfully operationalized big data programs require significant changes to the way users think about the business and approaches to information processes. Therefore a significant level of effort needs to be devoted to change management, communication, socialization and training. I put governance in this section because governance is the glue that holds all of it together.

A structure for managing the program is another essential element. It must cover every aspect: from business ownership and sponsorship, to developing the correct decision making processes and groups, to policies and compliance procedures, allocating resources, and prioritizing capabilities and technical functions. This structure does not have to be invented from scratch, but can be extended from and integrated with existing content, data and application decision making.

Governance activities include the planning, oversight and control over management of data, rules around use of data resources and development and implementation of policies, and decision rights over the use of data. Design roles and organizational structures required for leadership (councils and steering committees), data quality and curation (stewardship, ownership), problem solving and capability development (task forces, working groups) in the context of existing organizational governance structures.

The process needs to be sustainable in order to retain the attention and energy of business stakeholders and leadership. This will require designing appropriate membership of various groups and development of working agendas that are meaningful to the attendees. Design a framework for arriving at consensus (or simply making decisions) on data management policies and procedures along with audit requirements. Governance cannot be too heavy handed -- otherwise users will sidestep the process.

Governance Principles for Agile Analytics

Here are some guiding principles for agile analytics and governance initiatives:

  • Use standardized approaches for integrating and normalizing data for faster onboarding of data and data streams
  • Enforce upstream content and data curation and stewardship rather than allowing quality issues to propagate to systems
  • Catalogue data sources in an enterprise data registry, documenting ownership, source, cost, rights, usage, quality and provenance
  • Harmonize transactional and analytical data through master data management
  • Proactively manage semantic consistency and interoperability as a key component within information management programs
  • Continually calibrate and align analytics program operations with business objectives and outcomes

Business users, data architects, analysts will require training and education in order to make the most of big data programs.  Align the message and terminology with the concerns of each audience -- it makes no sense to use marketing examples for the manufacturing quality control process owner. Because the transition from reactive and transactional analytics to proactive and predictive analytics capabilities is such a large shift, it will take time for non-specialists to embrace the concepts. A training plan will help with this transition. New skills will be needed throughout the enterprise. Integrate these with a change management and communication plan.

Starting a big data program is probably the most important task for any business today. Sticking with the process over the years to embed this into the fabric of the organization will require leadership, vision and perseverance. But the reward will be an agile, adaptable organization that's ready to compete and grow in a world where data is the new oil.

Title image by Raúl Hernández González (Flickr) via a CC BY 2.0 license

Editor's Note: This is the final in a three part look at what it takes to create a big data program. Read all parts of the series here.