The Gist
- Test rigorously before deployment. Ensure comprehensive testing and staged releases to catch issues early.
- Diversify vendor dependencies. Avoid putting all your eggs in one basket by managing single vendor risks.
- Prepare for crisis. A well-defined, tested crisis management plan is crucial for minimizing brand damage.
CrowdStrike's recent global outage provides valuable insights for CIOs, CISOs, and even CMOs on handling of critical system updates and the maintaining of operational stability. It suggests that IT organizations need to more consciously address the risks of single vendor dependency across the board.
Delta Airlines' struggles in particular highlight the potential for brand damage from more than CrowdStrike — i.e. aging infrastructure and inadequate recovery strategies. To mitigate the potential for impact, organizations need to proactively adopt measures to ensure seamless change management and to protect our customer’s experience.
CrowdStrike's Lessons: Strengthening Testing, Deployment and Crisis Management
CIOs in this discussion openly shared the lessons they garnered from CrowdStrike's management of a change and release management. Here are the key takeaways that I extracted from their discussion:
- Implement comprehensive testing protocols: CrowdStrike needs to improve the depth and coverage of its testing. It also needs to allow its customers to stage or delay deployment of all updates for mission-critical systems.
- QA Process and Deployment: It's essential to have a robust QA process before deploying updates. Like a "canary in the coal mine," deployments need to start small to ensure that broader deployments do not cause a widespread outage.
- Have a Robust Crisis Management Plan: CrowdStrike and the organizations that it services need to have well-defined coordinated crisis teams and processes that are tested to handle critical situations.
- Consistent Caution in Deployments: When code is run in the OS kernel, every change can have high impact. CrowdStrike needs to treat every release with caution, assuming it could potentially break something, regardless of their past success.
- Deploy Updates in a Staged Manner: CrowdStrike needs to move to staged releases to identify issues before a full rollout. Releasing new code should be done in a controlled environment first.
- Recognize Automated Testing Limitations: Automated testing has limitations, and CrowdStrike needs to realize that reliance on automated testing only is risky.
- Ensure Rollback Capabilities are in Place: CrowdStrike needs to always have a rollback plan and ensure it is implemented. Manual intervention is not a scalable rollback strategy.
- Clear Communication and Trust with Customers: It is essential that CrowdStrike have someone for customers to talk to and address their issues especially with releases.
By addressing these areas, CrowdStrike will better manage change releases and mitigate the risks associated with updates for their customers.
Related Article: The CrowdStrike Outage: When CX Isn't a Priority
Experts Warn: Privileged Code and Lack of Rollback Plans Lead to CrowdStrike Disaster
Dion Hinchcliffe, vice president of CIO Practice at The Futurum Group, says, “I was very surprised to hear CrowdStrike was able to run privileged code in the kernel. Which is where this crash happened. Without being able to do this, it should have been a minor event. One root lesson learned: Be very careful about which vendors you allow to do this.”
Analyst Jack Gold agrees with Hinchcliffe, adding, “How do you release new code without running in a vaulted garden first to see what happens? And why release to all customers at once when you can do a stepped release to find issues? In my opinion, this was a failure to follow best practice.”
Meanwhile, Dennis Klemenz, chief technology officer for Jovia Financial Credit Union, argues, “CrowdStrike should always have a rollback plan and make sure that its implemented. Neither Microsoft nor CrowdStrike had a way to rollback these changes and that alone was the critical failure. Manual intervention is not a scalable rollback plan.”
Single-Vendor Risks: Protect Your Marketing with Proactive Management
The major lessons for IT organizations from the outage is with single-vendor dependency, risk management and operational stability are crucial. At a basic level, organizations should establish a risk registry if they don't have one, explicitly including any single-vendor dependencies and their associated risks. Defining and ensuring the accessibility of a risk mitigation plan is essential. This is a good idea for marketing systems as well.
Additionally, IT organizations shouldn’t assume that paying a premium guarantees competence. They should actively insist on high standards in vendors' change control processes.
New Zealand CIO Anthony McMahon says, “Don't be a passenger — insist on high standards in the change control processes.”
At the same time, Manhattanville University CIO Jim Russell is candid: “For most of us, single-vendor dependencies will remain. We need to have plans for what ifs and failure. We need trusted communications with each key dependency. Lastly, we need to be prepared to walk away if the vendor fails our trust repeatedly.”
IT monocultures are inherently fragile, and proactive risk management pays dividends. Even trusted vendors can present high operational risks, so single-vendor dependencies demand closer scrutiny. Business continuity planning remains challenging but indispensable. Effective crisis management, robust communication strategies and the ability to minimize damage are vital, as an organization's reputation is irreplaceable. Be prepared and have a plan for vendor failures, maintaining the flexibility to walk away if trust is repeatedly breached.
Related Article: How the CrowdStrike Software Failure and Outage Disrupted CX
Delta Airlines Lessons Plus Brand Damage
Organizations like Delta Airlines faced significant challenges in recovering from the change outage. These impacted customers and brand loyalty. The authors of the coming book, “Personalized," say that “service satisfaction comes down to moments of truth; for an airline, how quickly lost baggage is returned or a flight delay is resolved.” Despite previous recognition for differentiation, Delta's slow return to service nullified its historical investment in customer experience.
Their issues, say CIOs, stemmed from a combination of factors: reliance on third-party suppliers, a highly distributed systems, inadequate overall machine management and third-party connections beyond their control. Gold says, “There were so many issues. One is not all machines that run the Delta ecosystem are theirs. They have many 3rd party suppliers. Next, Delta is highly distributed, and their overall machine management is not up to the task. And they have 3rd party connections they don't control.”
While tech debt contributes to such disruptions, Delta’s problems reflect broader issues in architecture, disaster planning, staffing and leadership decisions.
Russell argues, “Tech debt is often a contributory cause for many of our ills. But as with any major disruption we can't rule out bad architecture, insufficient disaster planning, the impact of reduced staffing investment and leadership.”
Meanwhile, other CIOs believe that blaming tech debt alone oversimplifies the problem. It's essential to understand and address the decision-making processes that allow tech debt to persist. This includes recognizing the influence of legacy systems and processes, which played a role in Delta's difficulties. Proper planning, investment and transparent decision-making are crucial to mitigating such risks in the future.
Enterprise Architect Ed Featherston, added, “Tech debt can be for sure (though some companies survived because their Windows OS was too old to be impacted). Also, I look at it as legacy debt which includes both technology and processes, which I bet both were in play for Delta.”
Building Resilient IT: Proactive Crisis Management and Strong Vendor Backup Plans
In an increasingly digital era, organizations must take proactive steps to mitigate the impact of changes and disruptions. Customer experience is counting on this. One critical step is to establish a well-defined change control process and a robust risk management process. These help remain proactive rather than reactive during crises. Clear communication plans across the organization are essential to navigate and resolve issues effectively.
CIOs and their teams need to catalog their environment and prepare comprehensive recovery plans. Understanding and distributing roles and responsibilities is crucial.
During a crisis, automation and technology may fail, and reliance on well-prepared people becomes essential. Successful execution of contingency plans, with discrete roles for approval, escalation and remediation, can ensure a swift and coordinated response.
Russell says, “I was pleased when my team executed our plans with discrete roles. I approved the response and communicated. Another engaged contingency escalation paths. Others began remediating 'turfed' systems. Two hours later I gave all clear.”
Meanwhile, organizations should avoid placing all their eggs in one basket by ensuring every critical business process has a backup plan, with clear responsibilities and knowledge dissemination. Open communication is, also, key to maintaining alignment and readiness.
Additionally, evaluating and understanding vendors' backup and contingency plans is important. Critical vendors must have their emergency planning assessed both during the selection process and regularly afterward to ensure alignment with the organization’s risk management strategies.
McMahon argues that “CIOs should make sure every important business process has a backup plan, people know how to do it, and someone's responsible for starting it.”
Gold adds that “you must also know your vendor's backup/contingency plans, not just your own. Evaluating a critical vendor's emergency planning is necessary during any selection process, and especially once installed at a potential failure point for the organization.”
Having a well-defined crisis management team with specific roles and execution plans makes all the difference. Hinchliffe suggests organizations regularly revisiting basic IT assumptions, limiting critical dependencies, reducing IT monoculture and maintaining and testing good operational recovery procedures are vital. Investing in resiliency and ensuring governance, risk and compliance processes are practical and not merely bureaucratic exercises will help in building a robust, resilient IT environment capable of weathering digital age challenges.
Strengthen IT: Manage Risks and Vendors to Protect Customer Trust
In the wake of significant IT disruptions like those faced by CrowdStrike, it's clear it must prioritize robust risk management and contingency planning to minimize the impact on customers. Hinchcliffe says, “When I was climbing the ranks of FinServ IT back in the day, this was literally the scenario that was repeatedly framed for us: A bad day in IT could be a 'front-page of the WSJ' event. This sometimes made us request large dollars, for things like second data centers.”
Establishing a comprehensive risk registry and mitigation plan, ensuring clear communication and defined roles during crises and regularly revisiting and testing IT assumptions are essential steps. Organizations should diversify their vendor dependencies, evaluate vendor contingency plans and avoid relying too heavily on single vendors.
Ultimately, successful crisis management hinges on proactive planning, effective communication and a well-prepared team, highlighting the need for both technological and human readiness in an increasingly digital era.
Learn how you can join our contributor community.