What Financial Crises can Teach us about SRE
All crucial systems are built in order to be “safe for failure.” They serve two functions, as they are both vitamins and morphine, treaties and weapons. We must make sure our systems, too, are dually focused.
In light of the pandemic, the global economy is suffering. While this downturn is extreme, it’s not irreparable. In fact, after experiencing economic meltdowns such as the Great Depression and the Great Recession, we’ve learned much about how to regulate our economies to prevail through and recover from such upsets.
“Are We Safer? The Case for Strengthening the Bagehot Arsenal” by previous United States Secretary of the Treasury and President of the Federal Reserve Bank of New York Tim Geithner focuses on how disaster happens, disaster response, and the craft of financial crisis management. From Geithner’s experience and research, we can draw parallels between how financial crises can be managed to how we can view SRE as a crisis prevention and response solution within our organizations.
The key to this parallel is understanding that SRE, like financial regulations, must work both as a preventative measure to ward off possible incidents, and as a response to incidents that have already occurred. As Geithner states, “It’s just as in medicine, where protecting the health of individuals and the public depends not just on immunizations, nutrition, and regular checkups, but also on hospitals and emergency care, and the skills of doctors and nurses. Or as in national security, where the defense of the nation depends not just on diplomacy, espionage, moats and castles, but also on armies, with an arsenal of weapons and a tradition of constant training and the study of the conduct of war” (pg 3).
In short, all crucial systems are built in order to be “safe for failure”. They serve two functions, as they are both vitamins and morphine, treaties and weapons. We must make sure our systems, too, are dually focused.
How disasters happen
Before pinning down how disasters—or large-scale incidents and systemic issues—happen within our organization, let’s examine how these occur within our economy. The achilles heel of financial systems is that they are inherently vulnerable to panics and runs.
People are naturally optimistic and want to believe that the economy is safe. Geithner noted, “This dynamic fuels demand for money-like short-term liabilities, and lowers the perceived risk in financing long-dated illiquid assets. These liabilities are dangerous because they are runnable” (pg 4). Consider the crash of the housing market, for example. People assumed the economy was safe, took on too much credit by purchasing homes they might not have been able to afford due to mortgaging practices, and then were unable to pay for their loan.
These systemic shocks are the most difficult to counter. When trust is lost in financial institutions, they experience dangerous runs that threaten the larger economy.
In SRE terms, we can think of this through the lens of technical debt. Disasters happen because we trust in the stability of our system, so we take on too much “credit,” or accumulate too much technical debt. Geithner noted, “actions that seem sensible in terms of future incentives tend to exacerbate panics” (pg 31). We prioritize innovation above all else, and hope that our systems will be able to withstand the new dependencies. However, this leads to larger, more expensive incidents in the long run.
And we are not the only ones trusting in the stability of our systems. Our customers trust in them as well, and when trust is broken after a critical outage, we are susceptible to runs. This could be in the form of churn or fewer new customers signing out of caution.
When these runs occur, they can lead to larger systemic problems, like resource shortages for engineering teams looking to invest in tackling their technical debt. Additionally, after the crisis, SLA contracts may be tightened in order for customers to agree to take on the perceived risk of unreliable services.
In these cases, reactive code freezes in order to refactor technical debt might be the only solution available to you. In economic terms, Geithner puts it, “In a panic, there will be no source of private funding or equity capital available at an economic cost or on a scale that can substitute for the resources of the state” (pg 6).
So how can we prevent these systemic issues from occurring?
Preventing systemic crises
Sometimes it may seem like we spend a great deal of time looking for systemic risks, but yet they tend to find us first. Perhaps this is because stability breeds instability. When we are comfortable and confident in our systems, it becomes difficult to pinpoint our next big risk. For example, a world worried about abyss is a safer world than a world with less fear, as in 2006 prior to the housing market crash.
Geithner states, “Financial crises are not forecastable. They happen because of the inevitable failures of imagination, the limitations of memory, the fact that it is hard to be aware of all our biases and mistaken beliefs. Financial reforms cannot, by definition, give us protection against every conceivable bad event” (pg 16).
So we look to prevention. Yet even this is imperfect. As Lorin Hochstein pointed out in his article, “The Inevitable Double Bind,” “Whenever an agent is under pressure to simultaneously act quickly and carefully, they are faced with a double-bind. If they proceed quickly and something goes wrong, they will be faulted for not being careful enough. If they proceed carefully and something goes wrong, they will be faulted for not moving quickly enough.”
Geithner reiterates this conundrum: “The policy maker thus faces an interesting dilemma. If you use the authority you are given, it is likely to be taken away or constrained. If you don’t use it, you will be blamed for not acting with authority you were given” (pg 29).
How can we move quickly and innovate while still working carefully to prevent systemic failure and extreme amounts of technical debt?
There is no silver bullet to prevent systemic crises. Instead, success relies on vigilance, a balance between innovation and reliability, and a fair bit of skepticism about our own systems.
Responding to systemic crises
Failure is inevitable. Eventually, it is likely that your organization will face a systemic crisis of some sort, no matter how hard you work towards prevention. In this case, responding to this issue requires a two-fold approach. First, you’ll need to address the perception of the issue, both internally and externally. Second, you’ll need to come up with a resolution to this issue. These strategies must work in tandem in order for your systems to be restored to working order and to mitigate any possibilities of panics and runs.
Curating perception during a crisis
While perception is intangible, it can have tangible effects on the outcome of a crisis. For example, look to the European Sovereign Debt Crisis in 2012. During this crisis, President of the European Central Bank Mario Draghi addressed concerns during a speech in London. He famously said, “The ECB is ready to do whatever it takes to preserve the euro. And believe me, it will be enough.” As CNBC writer Silvia Amaro noted, “Draghi’s words rang loudly around the world’s trading rooms, investors believed his commitment and yields fell sharply across the euro zone.”
Draghi’s leadership gave the people back confidence in the system, avoiding a larger crisis. However, this does not mean that we should obfuscate the issues we face. When Zoom CEO Eric Yuan came under fire for Zoom’s privacy and security issues, he addressed the problem head-on with a sincere apology.
In a public blog post, he wrote “We recognize that we have fallen short of the community’s – and our own – privacy and security expectations. For that, I am deeply sorry, and I want to share what we are doing about it.”
Recognizing an issue requires your attention can also increase confidence, both from customers and members of the organization. When leadership takes the time to address concerns, it can proactively establish safety and empowerment during crises and can lessen the potential damage to the institution or organization. As Geithner states, “Only the government has the ability to arrest a general panic, to offset the collapse in private demand” (pg 7).
Resolving systemic crises
Resolving systemic crises requires shock absorbers to mitigate damage done to trust and confidence as well as a detailed action plan to correct the systemic issues that caused the crisis. As Geithner said, “What determines the severity of the outcome is the quality of policy choices made in the moment” (pg 5).
In finance, these shock absorbers might look like fiscal and monetary policies, such as the PPP loans in response to the COVID-19 pandemic. Within our organizations, this might look like a PR initiative, as well as renegotiation of SLAs and the capital routed to those violations. Determining potential shock absorbers ahead of crises (through disaster planning exercises) can help ease the pressure in the moment as “stronger shock absorbers means that the major financial institutions are better able to absorb losses” (pg 14).
As a PR initiative, we can engage within our communities to both seek out knowledge that can help us through tough times, as well as provide others with the knowledge that we have gained through our experiences. This community building can garner support from industry thought leaders and community members.
Additionally, we can look for support and testimony from model customers who have had good experiences with our product despite systemic issues, and who still see the merit in what we have to offer. Internally, we can seek this same support from champions and engage our leadership. This can help “break the panic by reducing the incentive for individuals to run from financial institutions and for financial institutions to run from each other” (pg 6). In other words, it can save you from customers and employees deserting you in your most dire moments.
While this may prevent runs for the time being, it won’t stave them off forever if we don’t begin investing in a long-term solution. It’s a tale as old as time, but sometimes we must attempt to spend our way out of crises. In tech, this looks like rerouting your resources to invest your way out of an unreliable system. While this is an uncomfortable undertaking, it’s crucial to preventing further losses.
Constraints on your ability to act
Similar to the constraints on the US due to the aversion towards bailouts, there will likely be constraints that inhibit your ability to react accordingly during a crisis. One of the major reasons behind this constraint is blame.
Rather than face the problem head-on, many organizations will have a knee-jerk reaction during a crisis to blame people for systemic failures. This blame can take many forms; sometimes it means blaming entire teams, facilitating a reorg that’s little more than a renaming. Other times this looks like firing leadership or the proverbial “scapegoat.”
This issue is so common and pernicious during crises that it was a major plot point in both of Gene Kim’s books, The Phoenix Project and The Unicorn Project. In the former, main character Bill receives a promotion after leadership terminated the previous CTO over perceived personal failures. In The Unicorn Project, main character Maxine is sent to “exile” after being labeled the scapegoat for an outage which she had no responsibility for and no participation in.
Gene writes, “She’s seen the corrosive effects that a culture of fear creates, where mistakes are routinely punished and scapegoats fired. Punishing failure and ‘shooting the messenger’ only cause people to hide their mistakes, and eventually, all desire to innovate is completely extinguished” (The Unicorn Project, pg 10).
Additionally, blame can sap the power from decision-makers. If leadership isn’t trusted to make decisions in a crisis due to increased policy and internal bureaucracy, “future economic shocks will likely cause more damage to the economy and impose greater losses on the financial system” (pg 35). Without trust, leadership has no power.
While these cultural issues must be dealt with, they are often not resolvable during the actual crisis. Instead, they are long-tail efforts that require a coordinated response to prioritize for a solution, not just a quick fix. As Geithner noted, “It’s hard to solve a moral hazard problem in the midst of the crisis, without dramatically intensifying the crisis” (pg 30).
Eliminating moral hazard for long-term benefits
In the world of finance, moral hazard is defined as “Any time a party in an agreement does not have to suffer the potential consequences of a risk.” Consider reckless investment bankers. If they face no consequences for irresponsible investments, there is little risk to them in failure. This can encourage them to take on riskier investments at the expense of someone else’s wallet.
In tech we see the same thing. In fact, it’s one of the reasons we have adopted DevOps best practices. Rather than developers simply writing code and throwing it over the wall for ops people to handle, we now ensure that responsibility for the service’s success lies in the hands of both parties.
SRE takes eliminating moral hazard a step further: if a service is too risky, or prone to crisis-level incidents, SREs are responsible for handing the pager back to the developers who wrote the code. This shift in procedure encourages a culture of accountability and reminds developers to ship responsibly.
Additionally, tools like SLOs and error budgets give teams guidelines for how to prioritize, ensuring that they have greater control over the balance between innovation and reliability therefore decreasing the likelihood of technical debt build-up. Afterall, if 10X engineers are creating technical debt without recourse, “This dynamic is not self-correcting. Left unchecked, it will simply accelerate” (“Are We Safer…” pg 6).
The craft of crisis management
At the core of every failure is the opportunity to learn. Crises are an exceptional opportunity to learn, though perhaps one of the most uncomfortable ways to do so. Geithner said, “If you look at the graveyard of financial crises, the variance of choices and outcomes is high, unacceptably high. Given the amount of experience available around the world among practitioners, and the diversity of mistakes we have all made, we should be able to narrow the variance in execution. Yet we tend to underinvest in this process of learning” (34).
Crises don’t cause systemic issues; they only surface existing ones. For example, COVID-19 has taught us a great deal about our ability to function during a pandemic. This systemic issue existed long before it. FEMA, WHO, and Bill Gates all predicted a crisis of this scale, yet we prioritized short term gains (other categories of public spending besides public health crisis prevention) over long term preparation.
If we knew something of this magnitude was coming, why didn’t we work harder to prevent it? Risk prevention is always seen as a cost, even after benefits are realized. However, if we don’t invest in learning, we will continue with this “tragic life cycle of crisis intervention and political reaction” (pg 29).
While finding the mental stability and space to learn during tumultuous times can feel next to impossible, it’s crucial that we make an effort to do so. Learning helps us flow with the waves rather than be swept away by them.
Operationalized, this means treating every incident, no matter how big or small, as an opportunity to learn and grow. Write incident retrospectives, read them, talk about them and see what they say about your system’s overall health. Treasure those moments of learning and remember, “Failure has its merits. It’s important for incentives, for innovation, for efficiency” (pg 8).
If you enjoyed this, check out these resources:
- Site Reliability Engineering for Business Continuity
- SRE for Business Continuity in the Face of Uncertainty
- How SREs can Embrace Resilience During Crises
Ideated by: Charlie Taylor
Get similar stories in your inbox weekly, for free
Share this story:
Get deep visibility into the performance of your complex enterprise applications and cloud native workloads. Identify potential issues, improve productivity, and ensure that your business and end users are unaffected by downtime and substandard performance ...
We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …
Harness the power of artificial intelligence (AI) and machine learning (ML) to monitor your IT resources with Site24x7's artificial intelligence for IT operations (AIOps) and machine learning operations (MLOps). Improve mean time to repair (MTTR) issues with the help of Site24x7 AIOps ...
In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …