Natural History of Resilience — Titanic to Chernobyl and Lessons for Complex Systems
‘Resilience’ seems more like a buzzword today. But as the trend graph shows above, we have started to lean into resilience over determinism in an increasingly complex, chaotic and connected world. I will share some historical narratives, practices and examples of building resilience into software systems through a proposed conceptual framework called “MARLA”.
What is Resilience
Traditionally speaking, resilience is the property of “bouncing back” from unexpected circumstances. Such “unexpected circumstances” are highly expected in modern systems. Long-lived code, infrastructure, human-computer interactions, vast scale and domains — each contributes to the ever growing complexity of the system. e.g., at Google’s scale failures with one in a million odds are occurring several times a second. “As the complexity of a system increases, the accuracy of any single agent’s own model of that system decreases rapidly.” In this world-view, resilience is not about reducing negatives or errors. It is rather a branch of engineering — how to identify and enhance the “positive capabilities of people in organizations that allow them to adapt effectively and safely under varying circumstances.”
In a slightly different ‘future proofing’ world-view, since outlier events are the ‘new normal’ we must abandon the statistical realm of risk management and embrace our beliefs about an uncertain future. In this approach, rather than bounce-back, we ‘bounce forward’ to a ‘new normal’ and plan for risks that cannot be predicted in advance.
The first-model of ‘bouncing back’ is perhaps more applicable to the majority of the software systems. We therefore will focus on that. However, since it is possible to simulate uncertainty in such systems using modern tools (e.g., Chaos Testing), we will integrate the ‘future proofing’ model in our conceptual framework as well.
Why Software Needs Resilience
More than two out of three lines of code running in our system is not written by anyone who works with us, or even anyone who ever directly worked with us. Number of lines of code in all libraries and external frameworks easily dwarf the code written by us. Then such unknown code will run at some unknown computer at some unknown location and interact with gazillion external services each with its own non-determinism. Failure is therefore a statistical certainty, and — without considerable thoughtful planning and processes — continuous success will remain a miracle. Resilience offers models and tools to conquer this non-determinism and avoid cosmic scale failures.
In medicine, doctors were once trained on Zebra principle — ”when you hear hoofbeats, think of horses not zebras”. This is a derivative of Occam’s razor where the simplest explanation is the most likely one. While the heuristics may still work on isolated humans, in inter-connected technology land it is Zebras all the way down! ‘Unexpected is regularly expected’ and ‘prepare to be unprepared’ are thus the best safety policies for organizations.
To extend the metaphor, regular “quality tests” check for known horses while — by induction — unknown zebras zip line into nasty production incidents. Since we can only test for horses (known/risk), we get zebras (unknown/uncertainty).
An example — flying birds hitting a plane is a very common problem and could cause serious damage to flights. Jet engines are therefore tested by throwing about 8 small- and 16 medium- sized bird carcasses (with feathers!) while running at full power. Small bird weight guidance is about 4.5 lb. and medium 8 lb. Most passenger planes are also required to have N engines and certified ability to fly with (N-1) engines ‘broken’. On January 15. 2009 , however, a flock of Canadian geese hit US Airways flight 1549. Canadian Geese easily weigh more than 15 lb. and they travel in flocks of hundreds. Both the engines of the Airbus A320 were damaged. This was an inevitable Zebra event that no one would anticipate in any “sandbox” environment. The resulting safe landing of passengers — known as “Miracle on the Hudson” — is a great example of a complex system showing resilience to Zebra events.
MARLA — Conceptual Framework for Resilience
In distributed computing “safety” and “liveness” are two inherent attributes of every system. Safety ensures “no bad thing ever happens”. Liveness implies “good (or, expected) thing ultimately happens”.
After working through safety features that work, hundreds of incidents and failures (many of those caused by yours truly!), sifting through history, studying real-life failures-at-large I propose MARLA — a five-headed conceptual framework to build and increase software resilience (therefore, also safety properties) -
Let us walk through each, starting with a narrative and offer some concrete pointers to work with in real life.
1. Monitoring — How not to miss the Icebergs ahead?
“Observability is Monitoring that went to college”.
Among many probable causes, a binocular was not available at “Crow’s Nest” to give early warning of an iceberg ahead. “There were binoculars aboard the Titanic, but unfortunately, no one knew it. The binoculars were stashed in a locker in the crow’s nest — where they were most needed — but the key to the locker wasn’t on board. That’s because a sailor named David Blair, who was reassigned to another ship at the last minute, forgot to leave the key behind when he left. The key was in Blair’s pocket. Lookout Fred Fleet, who survived the Titanic disaster, would later insist that if binoculars had been available, the iceberg would have been spotted in enough time for the ship to take evasive action. The use of binoculars would have given “enough time to get out of the way,” Fleet said.”
Monitoring has come a long way and has now branched into alerting, real-time log searches, and distributed tracing. Once we expected binary answers from our monitors — “Is the database up”, “Does the storage have enough capacity left” etc. Good, modern monitoring should raise questions — “Why do we have a spike in invoice creation on Monday noon”, “Why did the 4 PM batch job take 20% longer today” etc.
We can generally monitor three sets of things — Network (e.g., % of non-200 response variability), Machines (e.g., database server) and Application (e.g., read:write ratio).
A good framework to create useful monitors across the sets is RED -
Rate (e.g., number of transactions/second; CPU utilization),
Error (e.g., HTTP 500s; Query Timeouts; Deadlocks etc), and
Duration (e.g., 99-percentile latency for login etc).
While alerting on a large upswing variability across all three buckets is a good practice — rate metrics are usually “alerted” on system-specific thresholds (ALWAYS alert on CPU > 65%), error metrics often on a binary OR up/down trend, and application metrics are often paired with domain-specific adjacencies (e.g., if the login rate is high, standalone (attack?) OR in conjunction with a new session stickiness parameter recently launched).
Remember, monitoring in a vacuum — i.e., non-actionable metrics — is a drain to the system. If a tree is counted as “fallen” and no one looks into it, the forest still has the same number of trees as far as we are concerned.
Important metrics do not need immediate attention. Growth (e.g., TPV), Cost (e.g., Customer Call Rate), Rate (e.g., % of HTTP-500 errors) metrics are important. Urgent metrics require immediate action or escalation, and any exception should be automatically radiated to the entire team. Uptime, Availability, Critical Errors, Serious threshold breach (e.g., >80% CPU in Database server) are generally urgent metrics. Exception processing in Urgent metrics should also be properly documented (e.g., what to do if the midnight process stalls and does not finish within an hour). A good guidance for leaders is to choose 3 to 5 important metrics and look at it every day (urgent) or every week (important) to track variability. Proactive alerting should, of course, be set for operational teams or on-call engineers to respond to emergencies. But As Yogi Berra reportedly said, “You can observe a lot just by watching”. I have found the practice of starting- or ending the day by looking at the urgent metrics and weekly broadcast to the respective teams on important metrics very useful practice. Teams then pay attention to important metrics and start asking the right questions themselves. Choosing right metrics and medium for cascading itself becomes a big safety feature of the system.
While the general trend is to create an alert on exceptions, I strongly endorse the notion of “Positive Alerts”. High-performing teams I worked with recommend to radiate business-critical and regular processes’ positive outcomes to team Slack channels as well. An example of a positive alert could be “Successful Completion of Most Important Batch Job” that is expected around 6PM every weekday. In such cases, we have observed the absence of the positive alerts is immediately noticed and often a large failure can be avoided. i.e., if we tell people every day that “No tree fell today” and miss saying that one day, people usually walk to the forest to check out if things are alright.
2. Application Design — How to let things fail gracefully?
“In theory, theory and practice are the same. In practice, they are not.”
“Programming is a race between software engineers striving to build better and bigger idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning.”
“You’ve baked a really lovely cake, but then you’ve used dog shit for frosting.” — Steve Jobs
Titanic also had only 20 lifeboats — just enough to carry only about half the passengers. Designing domain-agnostic safety measures in application is similar to having enough lifeboats. In principle, this often boils down to the stacked-rank choice of what to let fail so more important stuff can survive. The survivability options of features or services must be a key design choice objectively made. i.e., upon what context should this service retry/shed load/fail-over/shutdown/respond with stale data etc? And how does that local choice impact global system (e.g., cascading failures, gap propagation, compensating transactions, manual clean-up etc.)?
Key Principles -
- Failures are inevitable, in both hardware and software.
- A priori prediction of all, or even any, failure modes is not possible. Since many failures are completely unexpected, it is not possible to address each type of failure in a cost-efficient manner. We therefore should embrace a set of safety patterns and implement them in a low-cost, high-fidelity way.
- Modeling and analysis can never be sufficiently complete. Monitoring, especially important metrics, should iterate into the model.
- Human action is a major source of system failures. Application design should over-index human action variability. This is what Google calls Hyrum’s Law: With a sufficient number of users of an API, all observable behavior of your system will depend on a random person doing a random thing. Someone could enter “999999” as the year in the only obscure page our engineers did not bother to validate field input and that could trigger a static date-driven logic to run forever leading to a system crash (note: exact same thing happened with me!)
Real life Lessons -
- Every failing system starts with a queue backing up somewhere.
- That somewhere is almost always the database! The more ‘down’ the stack, the more risk it poses to the system.
- RTT — Retry-Timeout-Throttle — are the three most useful and simple to implement safety attributes in remote or distributed calls. Still, at least half the remote calls I have seen implemented do not have automatic retry built-in!
- Generally speaking -
Retry — ensures safety of the Process
Timeout — ensures safety of the Instance
Throttle — ensures safety of Overall System
Software design today resembles automobile design in the early ’90s — disconnected from the real world. Then we rapidly built traffic-aware cruise control, ABS, backup camera, collision warning system and self parking in the last decade. Like a car today is expected to have these safety features even in ‘basic’ models, software applications can use standard, open-source libraries (e.g., Hystrix) to implement patterns like RTT with minimal or zero code.
Retry is our friend in simple use cases. Speculative retries, however, allow failures to jump the gap. A slowdown in the provider will cause the caller to fire more speculative retry requests, tying up even more threads in the caller at a time when the provider is already responding slowly. In advanced cases, use backoff (each retry spaced intelligently) and jitter (so multiple servers do not all retry together and create a “thundering herd” problem) to make retry safe. At a truly large scale, rejecting rather than retrying may add more global safety. Such a rejection pattern is also known as Load Shedding or Circuit-breaker.
Timeouts are more important at the core (e.g., in database) than at the edge. Retry and throttle are much easier to build on the edge than at the core.
3. Responding — What to do during the inevitable?
Failures, especially large-scale failures, are inevitable. What to do when it happens? After a big crash, air traffic controllers take more time to get back to work than pilots or cabin crews. Psychological study says involvement in action builds psychological resilience (as opposed to system). First rule of responding therefore is — participate and own it! Then try to contain it with no other associated damage from the containment. Containment is not resolution. An incident is contained when regular life/behavior could resume even if the system may not be at an optimal level.
Real life Lessons -
- 5-WHYs is a simplistic and underused tool. Can quickly get to an actionable task. Could be misleading too.
- 1-WHO or 1-WHAT is what our mind tries to quickly arrive at. As someone once said, “For every complex problem there is an answer that is clear, simple, and wrong”. Be careful of quickly arriving at a fix, or worse, pinpointing failure at anyone.
Listing human error as a root cause isn’t where we should end, it’s where we should start investigation. Getting behind what led to a ‘human error’ is where the good stuff happens, but unless you’ve got a safe political climate (i.e., no one is going to get punished or fired for making truly non-career limiting mistakes) you’ll never get at how and why the error was made. There is no pilot error, only cockpit errors.
However, reality is often different. A very well known leader, and someone who I deeply admire, told me after a really bad failure — “Every such incident takes away SIX months of your good work.” Empirically, I found it to be quite accurate. Teams grapple with blowback, technical or political, after a large failure for months. Another such reality, specifically in executive leadership, is well illustrated by the parable of the three envelopes. A leader’s declining influence and effectiveness is significantly correlated with either repeating incidents of the same nature or duration of a large-scale incident. Irrespective of the currency — fiat or crypto — the proverbial buck does stop with the leader!
Latitude, Longitude and Altitude could precisely pinpoint one’s location. In, especially, engineering leadership the three analogs of where you stand are -
Are you delivering big things in time? (i.e., Slippage)
Is your team’s productivity at par for the course? (i.e., Velocity)
How many big accidents from your team recently impacted business? (i.e., Safety)
Therefore, anyone’s natural charter should include resilience for both business and career reasons. Remember, incidents cannot be avoided but we can prevent many, if not all, accidents. While innovations can elevate the career ceiling, big incidents most certainly lower the floor often with the bottom falling out.
In the incident management “War Room” three principles to follow are -
- Keep your calm. Ability to retain calm under production pressure is a magical superpower. Listen to a clip of Gene Kranz later in the essay and his tone after “Houston, we have a problem” Apollo 13 incident.
- Gather data from multiple sources, aggregate and radiate it (e.g., using whiteboard). Allocate any concrete action item to one specific owner to look at. The fastest way to starve a horse is to assign two people to feed it. Core of successful Crisis Leadership lies in fast problem-to-single ownership mapping.
- List recent change events and prepare progressive rollbacks, where applicable. There is no shame in retreats if it saves lives.
Lastly, “do no more harm”. As Gene Kranz so eloquently blasted into space — “Let’s solve the problem, but let’s not make it any worse by guessing”. Software problem solving suffers a lot from what the medical field calls “Iatrogenic Deaths” — where the healer indirectly causes death.
4. Learning — Whatever didn’t kill us should make us stronger
“Good judgments come from experience, experience comes from bad judgments”.
- Learning from it is the success of each failure. Incidents are unavoidable but we must build immunity to the incident in the future.
- Always ask three big questions after each big incident -
When did we know it and who knew it first? MTTD (Mean Time To Detect).
When did we contain it? MTTR (Mean Time To Resolve)
Can it happen again tomorrow? MTBF (Mean Time Between Failures)
Broad action areas depending on the answer(s) —
If MTBF is more than a minute OR caught by a human (not a machine), we have a Monitoring problem.
If MTTR is more than an hour, we have an Application Design problem.
If MTBF is less than a month, we have a WTF problem (more specifically, we are not learning from past failures).
- Hollywood Homicide Principle — if a murder is not solved within 72 hours, it is unlikely to be solved. Create a uniform format “Incident Report” within 72 hours. Share the report transparently via a shared document drive. That drive becomes institutionalized knowledge. Following is a one-page template I like to use, but any structured template should work equally well, or even better,
5. Adapting — One Never Steps in the Same River Twice.
Murphy’s Law: “Anything that can go wrong will go wrong”.
Stigler’s Law: “No scientific discovery or law is named after its original discoverer”
- Anna Karenina Principle: Happy families are all alike; every unhappy family is unhappy in its own way. Each incident is a different zebra. By induction: if they were similar, they would be caught from testing/determinism. You don’t step in the same river twice. Either you will change or the river would. Likely both. System, user and their interactions are changing rapidly even in supposedly less-mutable workspaces like Payments.
- More than 50% of what we apply at Technology jobs is learned, possibly even invented, in the last four years. We must adapt, and adapt fast.
- Prepare well to be completely unprepared. Be at a constant unease about the system. Everything running smoothly is an aberration and, honestly, a miracle.
Opposable thumb evolved over billions of years. Then we went from Steam Engine to Auto-pilot under 200. How to speed adaptation in our system? This is where software systems offer significant advantage over organic systems. We can simulate “adaptation”, and therefore inject uncertainty and prepare the system, very cheaply with modern frameworks, elastic compute stacks and industrial strength monitoring tools. We can define a “Maturity Model” for such adaptation -
Level 1: Game Days are like vaccines to the known “diseases”. Game days, even “dry” game days where the team simulates a system failure and its reaction, tests the preparedness. These are the fire drills.
Level 2: “Right shift” and start run Synthetic Transactions in production. These transactions can then be varied with both amplitude and frequency to understand the limits of present tolerance.
Level 3: Highest level resilience engineering with modern systems is often called Chaos Testing. In this level, random events and uncertainties are carefully injected into subsystems to both test and understand the outcomes. “If you don’t test it, the universe will, and the universe has an awful sense of humor.”
A Capability Model — Evaluate Your Team’s Resilience Score
Add “1” for every “Yes” answer to the 10 questions — a score above 7 is good!
Do you have 3–5 key RED alerts for most important processes?
Did machines rather than humans report the last 5 customer impacting incidents?
Does every team member see these alerts in your team channel?
Does your team have “Positive Alerts” as well?
Do your distributed calls have Retry built in?
Do cross-stack calls Timeout?
Do API calls have Throttle set with the backend max capacity in mind?
Do you create the incident report within 72-hours of a customer impacting incident?
Is there a shared, up-to-date document drive for all your incident reports?
Do you have Synthetic Transactions for your key workflows running in production?
Watch out for Zebras — exotic failures are inevitable in today’s complex systems. Disasters are not.
History is a vast early warning system and we can, we must, learn from our failures. That is the success of failures. Emperor Augustus was completely inexperienced in warcraft when he “succeeded” his adopted father — Julius Caesar. Worse, contemporary historians documented he regularly suffered from “mysterious physical illness” during early key battles. He was neither physically imposing, nor had an elite blood lineage one needed to succeed in Rome those days. Despite all that, he “re-platformed” Rome from a republic to an empire, defeated several extremely talented generals (including Mark Anthony) and paved the way to the Greatest Enterprise ever — Roman Empire — that lasted 1500 years surviving formidable enemies. Augustus said about his success, “Young men, hear an old man to whom old men hearkened when he was young.”
In today’s increasingly complex systems, failures are inevitable. Disasters are not. Following time-tested strategies, principles and — most importantly — by looking through our monitoring “binoculars” we can spot the iceberg ahead in time to turn the ship around.
Resilience Reference Library
Measurement A Very Short Introduction — excellent ‘deep introduction’ to the science of measurement. The last chapter “Measurement and Understanding” highlights several flaws of solely relying on measurement (Cobra Effect, Body Count, Kill ratio et al). Can be finished in an afternoon.
Drift into Failure — Excellent introduction to Complex Systems and Systems Thinking — the cause and antidote to large scale uncertainties. Very enjoyable anecdotes including how Gorilla glass almost made Gorillas extinct (think of mining!)
The Field Guide to Understanding Human Error — fallacy of “root cause” and how to go beyond superficial blaming (i.e., learning)
Leadership in War — immensely enjoyable book of essays into War Leadership — from Hitler to Stalin to Thatcher. Stalin’s WW2 strategy is summed up as “in the end enough quantity becomes quality”.
Leadership in Turbulent Times — slightly less useful than the above, a very America-centric analysis of Crisis Leadership of Lincoln, Teddy Roosevelt, FDR and LBJ. LBJ section is the most interesting as he fully knew he was going against his/party’s interest to do the right thing.
Futureproof — lays out the “bouncing forward” model of resilience. Very applicable to really wicked problem like Climate non-determinism, nuclear energy and so on. Surprisingly readable.
Chernobyl (show)- gripping drama that covers science, emotion and human frailty around the epochal event. Must watch!
Release It! (2nd Ed) — everything one needs to design “safe” applications. Excellent treatment of the RTT and many other useful patterns with real life war stories. Cannot recommend high enough if you are interested in just the development side of things.
How Complex Systems Fail — canonical reading. Few short pages condense the overarching themes of failures. Also, watch many Youtube videos by Richard Cook (the author) for many more insights.