Look Upstream to Solve your Team's Reliability Issues
Originally published on Failure is Inevitable.
In “Upstream” by Dan Health, we explore a variety of different problems ranging from homelessness, to high school graduation rates, to the state of sidewalks in different neighborhoods within the same city. In each of these examples, Dan discusses how upstream thinking decreased downstream work. Upstream thinking is characterized as proactive, collective actions to improve outcomes rather than reactions after an issue has already occurred.
You can also apply this method to software development.
With technology moving at a breakneck pace, it's difficult to keep up with unplanned work such as incidents and unknown unknowns that come with increasing software complexity and interdependencies. Yet, we can’t halt development. As Dan points out, “Curiosity and innovation and competitiveness push them forward, forward, forward. When it comes to innovation, there’s an accelerator but no break” (“Upstream”, pg 224).
We can’t impede innovation, but we can Dan Heath’s wisdom from upstream thinking to move away from reactive modes of work and make our teams and our systems more reliable.
Barriers to upstream thinking
Before we can focus on implementing upstream thinking, we should acknowledge common barriers. Dan notes the problem here: “Organizations are constantly dealing with urgent short-term problems. Planning for speculative future ones is, by definition, not urgent. As a result, it’s hard to convince people to collaborate when hardship hasn’t forced them to” (220).
This might make it feel like everything is a barrier to upstream thinking. But Dan separates these issues into three groups: problem blindness, lack of ownership, and tunneling.
Problem blindness is self-explanatory: you are unaware that you actually have a problem. Issues and daily grievances are brushed off as just the way things are.
Consider alert fatigue. When you’re paged so often that you begin ignoring the alerts, you’re exhibiting problem blindness. Not only are you ignoring potentially important notifications, but you’re desensitized and possibly becoming burned out.
In this situation, you might hear people say things like, “Oh, that’s just the way it is. Our alerts are noisy. You can ignore them,” or “I can’t remember the last time I got a weekend off. You’ll get used to it.” Tony Lykke faced this issue and gave a talk at SREcon America in 2019. His talk, “Fixing On-Call when Nobody Thinks it’s (Too) Broken” describes this apathy.
It’s important to grow wise to problems. If you aren’t aware of them, you can’t begin to fix them. Question the status quo. Are there problems within your organization that have been dismissed or swept under the rug? These are sources of problem blindness. As Dan says, “The escape from problem blindness begins with the shock of awareness that you’ve come to treat the abnormal as normal” (37).
A lack of ownership
Another common problem with proposing upstream work is that it’s often voluntary. Nobody will make you do these things. It’s not prioritized as planned work in the context of regular business activities. It won’t get added to sprints, customers won’t put in feature requests for it, and so nobody will be assigned to it.
“What’s odd about upstream work is that, despite its enormous stakes, it’s often optional. With downstream activity—the rescues and responses and reactions—the work is often demanded of us” (41).
While service ownership is a possible solution, it can be difficult to drive large systemic change when faced with large amounts of unplanned, reactive work. It may be seen as an extra burden with no clear payoff, as benefits are hard to explicitly quantify especially in the short term. This leads to the last major barrier to upstream work, tunneling.
Tunneling is when you have too many problems to solve, so you ignore some to focus on the ones you need to fix. As Dan said, “When people are juggling a lot of problems, they give up trying to solve them all. They adopt tunnel vision. There’s no long-term planning; there’s no strategic prioritization of the issues. And that’s why tunneling is the third barrier to upstream thinking—it confines us to short-term, reactive thinking,” (59).
In short, there is no ability to engage in systems thinking. All cognitive capacity is directed towards resolving the reactive issue at hand. “It’s a terrible trap: If you can’t systematically solve problems, it dooms you to stay in an endless cycle of reaction. Tunneling begets tunneling,” (62).
And tunneling is rewarded! When you solve that hair-on-fire problem, restore service, or fix the bug you’re celebrated—this is otherwise known as hero culture and can also breed toxicity. Dan notes the allure of this as well: “Tunneling is not only self-perpetuating, it can even be emotionally rewarding. There is a kind of glory that comes from stopping a big screw-up at the last second” (62).
This only leads to burnout. There is no light at the end of this tunnel, so to say. According to Dan, there is only one way to avoid this: slack. “Slack, in this context, means a reserve of time or resources that can be spent on problem solving,” (63). Slack means being able to do upstream work. Instead of falling into the trap of being on-call heroes, foster a culture that creates on-call champions.
Applying upstream thinking to SRE
Making a commitment to upstream work is important to dig out of the reactive work hole many teams are in. But how do you begin? Dan has a few methods to share which are pertinent to SRE.
Uniting the right people
Dan believes that one of the most important steps in upstream thinking aren’t system related. They’re human. As people will be the ones solving these issues, we are the first piece to the puzzle, and the most crucial.
There’s a way to do this well. Dan notes that you should try to“...surround the problem with the right people; give them early notice of that problem; and align their efforts toward preventing specific instances of that problem" (88).
For example, you might be bogged down with incidents and unable to tackle the action items stemming from incident retrospectives and operational reviews. These action items sit in the backlog and are not planned for any sprints. To change this, you’ll need to get buy-in from many stakeholders. You’ll need engineers, managers, product teams, and the VP of engineering on board.
“Once you’ve surrounded the problem, then you need to organize all those people’s efforts. And you need an aim that’s compelling and important—a shared goal that keeps them contributing even in stressful situations,” Dan says (82).
Changing the system
Once your team is ready to embark on this journey upstream, you’ll need to work on actually changing the system. This can be one of the most difficult parts as it’s a long-tail effort. Systemic change rarely happens overnight. Instead, you’ll need to create and operationalize processes to drive behavioral change.
“Systems change starts with a spark of courage. A group of people unite around a common cause and they demand change. But a spark can’t last forever. The endgame is to eliminate the need for courage, to render it unnecessary, because it has forced change within the system. Success comes when the rights things happen by default— not because of individual passion of heroism” (109).
In our example above, changing the system could take many forms. One method is through slating all follow-up action items from incident retrospectives for planned sprints. If more urgent fixes are dealt with within a sprint or two instead of getting pencilled in for later dates, teams can avoid repetitive incidents and business risk.
You might find that follow-up action items aren't getting completed. You might mandate that all engineers involved in an incident have 48 hours to turn in their post-incident analysis. You should also give them time where they can work on their narrative uninterrupted.
You might find that some of the work on action items doesn’t cover what your team feels the deeper issues are. Maybe action items are trivial, one-time fixes that will only cover certain edge cases. Set aside time each month where engineers are able to work on projects that they think make the biggest impact.
Finding a point of leverage
Surrounding the problem and creating systemic change are important, but it’s also important to know your leverage points. Afterall, there will be people who ask “Why are we wasting time on this when we could be building that?”
Money is often the driving factor. Developer hours are costly, but are they more costly than outages? “A necessary point of finding a viable leverage point is to consider costs and benefits. We’ll always want the most bang for our buck,” Dan notes (127). If your organization is losing thousands or even millions of dollars to outages, the cost-benefit analysis might be much easier; outages are too expensive to continue. However, if outages aren’t causing too much disruption to the bottom line, it can be more difficult to express the need for upstream work.
Dan recognizes this. “One of the most baffling and destructive ideas about preventative efforts is that they must save us money. Discussions of upstream investment always seem to circle back to ROI: Will a dollar invested today yield us more in the long run?” (127).
Many times ROI will be impacted, but in cases where ROI isn’t enough of a payoff for the investment, you can search for other leverage.
In the case of our example problem, we can look at developer happiness. These teams are tired. They’re burned out. Maybe they’ve become apathetic. Investing in upstream work can improve the situation drastically. Additionally, it will save managers and HR costly resources to rehire, as higher job satisfaction leads to lower turn-over rates. As people are an organization’s greatest competitive advantage, one of upstream work’s most important potential outcomes is fostering a healthy culture to retain talent.
Getting early warnings of problems
Most organizations would want to know when our developers are unhappy, so that they can take proactive measures to preempt them from leaving. The same is true of knowing when customers are unhappy to prevent them from churning to a competitor. And for a multitude of reasons, organizations also want to avoid expensive SLA violations.
These are all mission-critical signals to have visibility into. But how can we proactively understand which services and contributing factors are most likely associated with these problems?
Early warning systems are important, but you’ll need to know how you intend to use them. As Dan notes, “There’s no inherent advantage to early warning signals. Their value hinges on the severity of the problem...The value also depends on whether a warning system provides sufficient time to respond” (137).
In this case, SLOs can be a good indicator. If a service experiences many outages, it’s likely to lead to the problem of unhappiness. SLOs indicate the minimum functionality a customer will expect before experience suffers. In this case, it can also be used to detect when developers are likely to feel overwhelmed.
Imagine this team sets SLOs that monitor the availability of a certain feature that developers are often paged for. When it reaches a certain threshold for a predetermined period of time, the team is required to halt feature development per the escalation policy in order to focus on systemic issues that lead to unreliable service. This gives developers the time to refactor code, fix bugs, establish monitoring and automation, and make a more stable service that requires fewer engineering interventions in the future.
Our example team has made great strides in upstream thinking, but how can we know what success looks like when we see it? In cases like these, success is often measured by things not happening. This can be hard to prove the effectiveness of.
As Dan notes, “With upstream efforts, success is not always self-evident. Often, we can’t apprehend success directly, and we are forced to rely on approximations—quicker, simpler measures that we hope will correlate with long-term success” (153).
We’ll need to find a way to measure success, though it won’t be a direct correlation. After all, if we’ve decided that the real problem is a general feeling of unhappiness, it’s impossible to accurately measure an increase in happiness. The process for measuring success is also tricky.
“Getting short-term measures right is frustratingly complex. And it’s critical. In fact, the only thing worse than contending with short-term measures is not having them at all,” Dan says (160).
In this case, we can look at a few metrics:
- Turnover rate in engineering
- SLA violations
- Uptime per rolling window
- Employee surveys
Of course, these are all lagging indicators, but positive trends across these vectors can help quantifiably demonstrate the value of upstream thinking.
Avoiding doing harm
While efforts to make systemic improvements are always well-meant, sometimes they can have unintended consequences. Dan notes that, “Upstream interventions tinker with complex systems, and as such, we should expect reactions and consequences beyond the immediate scope of our work” (174).
Sometimes our improvements will break things in the system unintentionally. Sometimes they may actually exacerbate the problem they attempt to solve. We might not even notice when this happened, and if our short-term measurements look to be in good order, we might overlook the consequences.
This is why feedback is so critical. We need to be actively seeking out feedback at every opportunity and making sure there is room for the feedback to be open-ended and qualitative. Systemic improvements are highly complex. We need to be on the lookout for additional tangles, which is why well-defined change management—in context of how your organization operates—is key.
Dan reminds us of this. “We can’t foresee everything; we will inevitably be mistaken about some of the consequences of our work. And if we aren’t collecting feedback, we won’t know how we’re wrong and we won’t have the ability to change course” (180).
Systemic change is difficult and hairy, but as change is the only constant, an organization’s ability to adapt to change will make or break its success. If we aren’t able to evolve, we will fail. Because of this, it’s truly more important than ever for us to look to upstream methods of problem solving to ensure we aren’t swept away by the current.
If you enjoyed this, check out these resources:
Share this story with your friends
Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.