A lie is being told in the meeting rooms and offices of corporate America. The lie is so basic and pervasive that it can be found in institutions throughout the world.

The lie is linear causality. People use linear causality — where cause and effect are always determinable — to simplify what in reality may be much more complicated problems. 

Simplification allows people to make progress against multi-dimensional problems which otherwise might appear intractable. The problem with simplification is that we easily forget the truth. Because the time always comes when we must do the hard work of managing complex, open ecosystems.

DevOps, amongst other things, is a systems and design thinking approach to the problems that enterprises face in developing and managing IT capabilities. If you understand this starting point, then you will be able to see the priority and importance of each of the pillars and practices. 

One practice that gets a lot of attention from DevOps thought leaders, and CTO of Etsy, John Allspaw in particular, is the "blameless post-mortem.”

A Service Design Approach to Systems Thinking

Bringing new holistic approaches to incident management playbooks is critical for a sustainable and meaningful DevOps transformation because:

  • How individuals, departments and enterprises behave in the wake of an impactful outage is the test of character and commitment to the DevOps way
  • The operating models of IT groups are most vulnerable to a call for change — either from within or without — in the aftermath of an incident
  • The incident management process is fertile ground for teaching the principles of DevOps to newcomers

In the wake of an incident, no one can reasonably deny that something (or things) have gone wrong which allowed the incident to occur. And as we all know, acceptance is the first step towards meaningful progress.

After fixing whatever damage an incident may have caused, the post mortem process begins. First comes the creation of a "root cause analysis" (RCA) artifact. An RCA has many purposes, most notably to:

  • Allow stakeholders to understand the impact of the incident 
  • Create the awareness required to avoid the event in the future &/or to minimize its impact if prevention is unobtainable
  • Strengthen cultural ethics on accountability and continuous improvement 

Form Doesn't Always Mean Formalism: An RCA Template

While dogmatic formalism has no place in a mature DevOps shop, some form and discipline is required to teach development and operations teams the basic mechanics of running a complex IT implementation within a business context.

Many shops breeze over the format and content of an RCA. But a significant amount of research points to a connection between the "hows" of post mortem activities and the overall performance of an IT shop. 

It's with this idea in mind that the following content template was designed for an RCA:

Issue: a brief one to two sentence description of the incident

Impact: A paragraph with details on the stakeholder impacts that result from the the incident inclusive of any and all quantitative metrics that can be gathered from system telemetry including but not limited to:

  • Start and end date-times for the incident
  • Business channels effected
  • Number of users effected
  • Financial KPIs impacted with statements of impact if possible
  • Impacts to in process or future projects
  • Impacts to team morale 

Direct Cause: A brief one to two sentence description of the action or inaction that preceded the impact while de-identifying the action or inaction as much as possible. When writing this section to avoid:

  • Using names of individuals
  • Implying that the direct cause can stand alone as the actual reason for the incident, as this is almost never the case

A good example: "A configuration file meant for internal service routing was manually deployed to the wrong location”

A bad example: "Bill Carlson from the operations team copied the service.xml file to the wrong location"

This is the point in the RCA where the focal point of the analysis shifts from the incident to the enterprise. Beyond an incident and a proximal cause lies a fertile field of potential changes that can be made. Incidents don't happen because an individual made a mistake. Incidents happen because enterprises have not yet created the conditions to prevent them.

Why it wasn’t caught/fixed sooner: A paragraph with details on the contextual conditions that allowed the incident to remain in place for the duration of the impact, including but not limited to:

  • Gaps in tests and detection methods that did not account for the incident
  • Breakdowns in vigilance (but not to any individual) that allowed the incident to go undetected

Systemic causes: A paragraph with details and/or bullet points on the contextual conditions that allowed the incident to occur, including but not limited to:

  • Conditions that contributed to teams and individuals being unable to detect or prevent the incident (e.g., an after hours deployment will by definition have lower levels of vigilance than a deployment during business hours)
  • Inventories of manual processes that have yet to be automated
  • Gaps in funding and/or staff for efforts that would have detected, prevented or obviated the incident
  • A lack of roadmaps or lack of support for roadmaps to increase system and infrastructure maturity or to remove obsolete or unsupported system components 

Near Term Actions: A paragraph with details and/or bullet points on the efforts that will be taken by teams to prevent future recurrence of the incident to occur, including but not limited to:

  • Changes in policies and procedures designed to prevent recurrence of the incident (or to mitigate any future recurrence)
  • Low cost, low effort changes to testing and other systems infrastructure designed to prevent recurrence of the incident (or to mitigate any future recurrence)
  • Training and educational efforts with staff designed to prevent recurrence of the incident (or to mitigate any future recurrence)

Long term Recommendations: A paragraph with details and/or bullet points on the larger efforts that have to be taken at the enterprise level to prevent future recurrence of the incident to occur, including but not limited to:

  • Changes in policies and procedures that require a long term enterprise commitment (e.g., changes to an enterprise development methodology, changes in how a product or service is sold, etc.)
  • Changes in systems and infrastructure that require a long term enterprise commitment in capital, labor or time (e.g., changes in architecture to eliminate single points of failure, upgrades in hardware and/or software, movement to a technical environment the supports elastic capacity, etc.)

Failure Is Not Fashionable, Learning Is

Failure has been riding high for the last few years as more and more shops adopt agile, experimentation and DevOps. Without taking advantage of the learning opportunities that come along with failure, enterprises are doomed to repeat the same mistakes over and over again, and will never transcend the current limitations of their operating model.

Just like “acceptance of failure,” the authoring of holistic RCAs is only one part of what is necessary to create a modern learning enterprise. If your shop writes enough of them, maybe your workforce will stop looking at problems through the lie of an overly simplified, one dimensional lens.

Title image "Spilt Milk" (CC BY-SA 2.0) by  Lee Jordan