pliant tree in ice
PHOTO: jef 77

There seems to be a lot of talk and articles these days about resilience.

I have tried to ignore the term, but recently read an interesting piece in Forbes: "What Is True Resilience? (Hint: It’s Not About Managing Risk)." Before we dig into that piece, it is interesting to see what people have said about the difference between "risk" and "resilience."

The Difference Between Risk and Resilience

One academic has written (key sentence highlighted):

"Resilience is essential to living in a world filled with risk. Resilience has historically been defined as the ability to return to the status quo after a disturbing event. However, in the face of a changing climate and growing population, resilience cannot be based on the capacity to recover from the sorts of disasters we have faced in the past, but requires that we build capacity to avoid damage and/or recover from to the sorts of disasters we can expect to face in the future. If our goal is a sustainable future, we must understand the risks we will face and prepare for those risks through adaptation and mitigation measures. Resilience is crucial in this endeavor, as it is our capacity to cope with both expected events and surprises. To this end, it is critical that we identify, assess, communicate about, and plan for risks that the future will bring."

The OECD has shared:

"The ability of households, communities and nations to absorb and recover from shocks, whilst positively adapting and transforming their structures and means for living in the face of long-term stresses, change and uncertainty. Resilience is about addressing the root causes of crises while strengthening the capacities and resources of a system in order to cope with risks, stresses and shocks."

Professor Linkov of the University of Connecticut tells us:

"Traditional risk management focuses on planning and reducing vulnerabilities. Resilience management puts additional emphasis on speeding recovery and facilitating adaptation."

The Forbes article is written by a practitioner rather than an academic or consultant. That makes it more interesting as it is based on experience borne out of responsibility. The author, Will Grannis, is the leader of Google Cloud’s Office of the CTO. He says his customers are asking how his organization’s services “stay resilient in the face of many unexpected, unpredictable events.”

This experience is of interest:

"Just this week, unprecedented weather patterns across the US pushed many IT and business leaders to virtual 'war rooms' in order to ensure capacity, networking, and applications were instantly and persistently available. But those rooms were in the moment, rapidly assembled and then rapidly disassembled — just like the technology that underpins the real-time applications and services we all depend on. This is the new normal, and it calls for a new model of operations. Rather than setting a fixed reliability as the calculation for contracts and practices, the focus must be on resiliency under any number of conditions."   

Building on that, he says (key sentence and words highlighted):

"True resilience isn't about managing a particular instance of risk, but being ready for anything through the way you operate. Today’s disasters may come from wild, unanticipated success (leading to traffic spikes) as much as devastating unforeseen failure (be that a natural disaster, a political event, or a system configuration error that cascades into a global outage)."

The rest of the article explains what happened at Google Cloud and some of the company's philosophy around architecting its services for the general (not specific) customer. There is a continuing article about its approach to resilient IT.

Related Article: How to Improve Organizational Resilience

When a Disaster Recovery Plan Isn't Enough

This reminds me of my own experience when I was a vice president in IT at a large financial institution. One of my responsibilities was to develop a disaster recovery plan for our two data centers. I hired Ann Tritsch as my DRP Coordinator (a direct report at manager level). She led the initial effort and we soon faced an important question: did we need to build separate sections of the plan covering the various causes of a disaster?

Operations already had sound processes in place to address and recover promptly from a short outage and our task was to determine how data center operations would recover from an event or situation that would shut down one or both data centers for a longer period. This could be the result of:

  • A fire
  • An earthquake (we were in Southern California)
  • A flood (we were in an area that could possibly flood if a dam broke or there was an extended period of torrential rain), or
  • Some other reason

At that time, emerging thinking was that the planning should address how you recovered, regardless of the cause. That is how we built our plan (with the help of a software solution, I should add)

But the DRP was not enough. We still had to concern ourselves with making sure the likelihood of a disaster and the effect on the business were minimized — given cost and other constraints.

For example, our senior vice president led an effort to determine whether it would be viable to establish what would amount to mirroring the data center. He was looking at the possibility of sending copies of every transaction processed at one data center to the other by satellite (which we did not yet have — and this was before the age of the internet). But the cost was prohibitive. In addition, the two data centers were less than 20 miles apart, so a regional disaster could well affect both.

Ann performed, with the assistance of the operations staff, a review that we would today call a risk assessment. It considered each of the causes we might anticipate and confirmed that we had an acceptable set of measures in place. For example, we considered loss of power and examined the power system and the ability to either switch to a different power station or rely on our battery back-ups. We also recognized that there was a single point of failure in the network where all traffic from outside Southern California passed through a single station; but there was little we could do to minimize the possibility of an outage.

This still was not enough. While there were some causes of a prolonged outage that we could identify, there was always the possibility of an ‘unknown unknown’: something happening that we could not seriously identify as a likely event, such as being hit by a meteor, a pandemic (worse than today’s), or a terrorist attack.

With this in mind, we developed another plan that we called a Disaster Preparedness Plan. The DPP was designed to help us recover from any event (including unknown unknowns) that would cause more than a short disruption of our data center’s operations.

The DPP included a detailed Communications Plan. While we didn’t know with certainty who in management might be required to respond to the event or situation, we developed the necessary structure and processes.

Between these initiatives and plans, we did what we could to make ourselves what would today be called "resilient."

Related Article: 2021: The Year of Organizational Resilience

It All Comes Down to Preparedness

What I like about the idea of resilience is that rather than designing response around specific foreseeable events and situations, it pre-plans and prepares you (as best you can) for what you cannot predict.

To quote the Google executive's Forbes article again:

"True resilience isn't about managing a particular instance of risk, but being ready for anything through the way you operate."

Personally, I believe in monitoring and considering what might happen so you can both include it in decision making and be prepared to respond to foreseeable events and situations. But I also believe in being as prepared as possible to respond to (and mitigate if you can) unforeseeable events and situations.

So, resilience merits our attention in addition to or as an integral part of any "risk management" activity. (As usual, please note that I much prefer managing for success.)

There is one more very important aspect to this discussion.

In the same way that you should be prepared and resilient for unforeseen adverse events and situations, you need to be agile and sufficiently aware and responsive to unforeseen opportunities.

People pay far more attention to the first and far too little to the second.

I welcome your thoughts.