DevOps How To: Learning From Failure

A lie is being told in the meeting rooms and offices of corporate America. The lie is so basic and pervasive that it can be found in institutions throughout the world.

The lie is linear causality. People use linear causality — where cause and effect are always determinable — to simplify what in reality may be much more complicated problems.

Simplification allows people to make progress against multi-dimensional problems which otherwise might appear intractable. The problem with simplification is that we easily forget the truth. Because the time always comes when we must do the hard work of managing complex, open ecosystems.

DevOps, amongst other things, is a systems and design thinking approach to the problems that enterprises face in developing and managing IT capabilities. If you understand this starting point, then you will be able to see the priority and importance of each of the pillars and practices.

One practice that gets a lot of attention from DevOps thought leaders, and CTO of Etsy, John Allspaw in particular, is the "blameless post-mortem.”

A Service Design Approach to Systems Thinking

Bringing new holistic approaches to incident management playbooks is critical for a sustainable and meaningful DevOps transformation because:

How individuals, departments and enterprises behave in the wake of an impactful outage is the test of character and commitment to the DevOps way
The operating models of IT groups are most vulnerable to a call for change — either from within or without — in the aftermath of an incident
The incident management process is fertile ground for teaching the principles of DevOps to newcomers

In the wake of an incident, no one can reasonably deny that something (or things) have gone wrong which allowed the incident to occur. And as we all know, acceptance is the first step towards meaningful progress.

After fixing whatever damage an incident may have caused, the post mortem process begins. First comes the creation of a "root cause analysis" (RCA) artifact. An RCA has many purposes, most notably to:

Allow stakeholders to understand the impact of the incident
Create the awareness required to avoid the event in the future &/or to minimize its impact if prevention is unobtainable
Strengthen cultural ethics on accountability and continuous improvement

Form Doesn't Always Mean Formalism: An RCA Template

While dogmatic formalism has no place in a mature DevOps shop, some form and discipline is required to teach development and operations teams the basic mechanics of running a complex IT implementation within a business context.

Many shops breeze over the format and content of an RCA. But a significant amount of research points to a connection between the "hows" of post mortem activities and the overall performance of an IT shop.

It's with this idea in mind that the following content template was designed for an RCA:

Issue: a brief one to two sentence description of the incident

Impact: A paragraph with details on the stakeholder impacts that result from the the incident inclusive of any and all quantitative metrics that can be gathered from system telemetry including but not limited to:

Start and end date-times for the incident
Business channels effected
Number of users effected
Financial KPIs impacted with statements of impact if possible
Impacts to in process or future projects
Impacts to team morale

Direct Cause: A brief one to two sentence description of the action or inaction that preceded the impact while de-identifying the action or inaction as much as possible. When writing this section to avoid:

Using names of individuals
Implying that the direct cause can stand alone as the actual reason for the incident, as this is almost never the case

A good example: "A configuration file meant for internal service routing was manually deployed to the wrong location”

A bad example: "Bill Carlson from the operations team copied the service.xml file to the wrong location"

This is the point in the RCA where the focal point of the analysis shifts from the incident to the enterprise. Beyond an incident and a proximal cause lies a fertile field of potential changes that can be made. Incidents don't happen because an individual made a mistake. Incidents happen because enterprises have not yet created the conditions to prevent them.

Why it wasn’t caught/fixed sooner: A paragraph with details on the contextual conditions that allowed the incident to remain in place for the duration of the impact, including but not limited to:

Gaps in tests and detection methods that did not account for the incident
Breakdowns in vigilance (but not to any individual) that allowed the incident to go undetected

Systemic causes: A paragraph with details and/or bullet points on the contextual conditions that allowed the incident to occur, including but not limited to:

Conditions that contributed to teams and individuals being unable to detect or prevent the incident (e.g., an after hours deployment will by definition have lower levels of vigilance than a deployment during business hours)
Inventories of manual processes that have yet to be automated
Gaps in funding and/or staff for efforts that would have detected, prevented or obviated the incident
A lack of roadmaps or lack of support for roadmaps to increase system and infrastructure maturity or to remove obsolete or unsupported system components

Near Term Actions: A paragraph with details and/or bullet points on the efforts that will be taken by teams to prevent future recurrence of the incident to occur, including but not limited to:

Changes in policies and procedures designed to prevent recurrence of the incident (or to mitigate any future recurrence)
Low cost, low effort changes to testing and other systems infrastructure designed to prevent recurrence of the incident (or to mitigate any future recurrence)
Training and educational efforts with staff designed to prevent recurrence of the incident (or to mitigate any future recurrence)

Long term Recommendations: A paragraph with details and/or bullet points on the larger efforts that have to be taken at the enterprise level to prevent future recurrence of the incident to occur, including but not limited to:

Changes in policies and procedures that require a long term enterprise commitment (e.g., changes to an enterprise development methodology, changes in how a product or service is sold, etc.)
Changes in systems and infrastructure that require a long term enterprise commitment in capital, labor or time (e.g., changes in architecture to eliminate single points of failure, upgrades in hardware and/or software, movement to a technical environment the supports elastic capacity, etc.)

Failure Is Not Fashionable, Learning Is

Failure has been riding high for the last few years as more and more shops adopt agile, experimentation and DevOps. Without taking advantage of the learning opportunities that come along with failure, enterprises are doomed to repeat the same mistakes over and over again, and will never transcend the current limitations of their operating model.

Learning Opportunities

WebinarJul 22, 2026 · 11:00 AM PDT

Replacing Tasks, Not Roles: The Changing Nature of Contact Center Work

Birds sitting on a tree branch like a content team

WebinarJul 23, 2026 · 11:00 AM PDT

How Fast-Moving Content Teams Keep Up as Sites Grow

WebinarJul 30, 2026 · 11:00 AM PDT

From Automation to Intelligence: How Leading Teams Are Rethinking Operations

WebinarAug 19, 2026 · 9:00 AM PDT

How to Win the War for Agentic Citations: The AEO Playbook You Need Now

Promotional banner for CX Retail USA Exchange 2026, an invite-only customer experience and retail leadership conference in Atlanta on Sept. 14–15, 2026.

ConferenceSep 14, 2026 · 7:30 AM EDT

CX Retail Exchange USA Atlanta 2026

Gaylord Rockies Resort & Convention Center in Aurora, Colorado

ConferenceNov 4, 2026 · 9:00 AM MST

Gartner Customer Service & Support Conference Denver 2026

Prove the significant result not only in soccer

WebinarOn Demand

Content Leaders Collective: Proving Content's Business Impact Starts With the Right CCMS

Watch Now

WebinarOn Demand

Why Some Dealers Are Pulling Ahead With AI

Watch Now

View All

Just like “acceptance of failure,” the authoring of holistic RCAs is only one part of what is necessary to create a modern learning enterprise. If your shop writes enough of them, maybe your workforce will stop looking at problems through the lie of an overly simplified, one dimensional lens.

Title image "Spilt Milk" (CC BY-SA 2.0) by Lee Jordan

fa-solid fa-hand-paper Learn how you can join our contributor community.

A Service Design Approach to Systems Thinking

Form Doesn't Always Mean Formalism: An RCA Template

Failure Is Not Fashionable, Learning Is

About the Author