Blog

How to Build Resilient Hybrid Cloud Systems: Lessons from the CrowdStrike Outage

July 25, 2024Business Continuity Insights and Outlooks

Diverse group of young businesspeople discussing a project during a meeting together around a table in a modern office

Understanding Today’s Complex IT Ecosystem

The current IT landscape has evolved into a hyperconnected mesh of heterogeneous, interdependent software and hardware systems. This didn’t happen overnight, but gradually and continuously over decades.

When we experience an incident like the massive global outage triggered by CrowdStrike, it’s natural to look for blame and attempt to isolate a single root cause. We may even blame the system itself. The pressure to provide real-time software updates, the pattern of outsourcing responsibility and risk. 

However, in such complex systems, the convenience of a single cause is a myth. Multiple mistakes and many people are usually involved, most of which are invisible. It also does not provide us with much in the way of a solution. Abandon the whole system? Remove all your external dependencies and write every line of code in house? Not likely. 

When we take a step back and examine this catastrophic failure through the lens of the larger IT Ecosystem, there are lessons we can take away to improve our systems and processes and mitigate our risk of being impacted by such events in the future. 

Learning from Failure: The Importance of Resilience

Werner Vogels, CTO of Amazon, once said, “Everything fails all the time.”

He understood that every piece of software or hardware, whether internal, open source, or commercial, is going to break at some point. Accepting this reality allows us to focus on making modern distributed systems resilient, not impervious. How exactly do we do this?

Chaos Engineering

Chaos Engineering involves actively introducing failure into your system to learn how it will react and solve for failures before they impact end users. How many customers of CrowdStrike had run scenarios to understand the impact if that software failed?

Advanced Deployment Techniques

Leveraging techniques like canary deployments can minimize the impact of a bad update. Gradually deploying to a small percentage of targets reveals issues before they affect most users. It appears that CrowdStrike deployed faulty updates to all systems at once, maximizing the blast radius.

Diversifying Your Architecture

A hybrid cloud setup is a great way to ensure you don’t keep all your eggs in the same basket. Using different approaches, technologies, and strategies to achieve the same functionality helps avoid common-mode failures.

Ensuring Continuous Service

Graceful degradation is about designing systems to maintain partial functionality even when parts of the system fail. During the CrowdStrike outage, some airlines provided handwritten boarding passes as a temporary solution. Having solid incident response procedures in place to quickly identify failures and push out fixes is crucial. To ensure effectiveness during an actual disaster, you must regularly practice these procedures.

Continuing to Serve Our Users

The CrowdStrike incident is the latest, and potentially largest global IT outage to date. However, our dependence on outsourcing, third-party software, and demand for continuous updates ensures that it won’t be the last.

By anticipating failures and understanding the vulnerabilities inherent in our systems, we can minimize negative impacts to our business and better serve the end user. Learn more about building resilient systems and optimizing your hybrid cloud approach here.

About the Author

Ian Crosby is the Field CTO at Aptum, a global infrastructure and cloud solutions provider. With over 10 years of experience in IT and cloud computing, Ian specializes in developing resilient hybrid cloud systems and guiding businesses through digital transformation.

Share This Article

Back to the (on-prem) Future

KubeCon EU 2024 Recap

Meet Ian Rae, the new president and CEO of Aptum

Get in Touch

Start the conversation

Want to learn more about how to unlock the potential of your data infrastructure? Talk to an infrastructure solutions expert today and find out how Aptum can help!

Get in Touch

Exclusive Limited-time Promo

Hybrid IT Solutions

Services

Latest: Cloud Isn’t a Place. It’s an Operating Model.

Core Resources

Customer Insights

Company

Insights

Hybrid Cloud