Enabling the Stripe and Lyft Platforms Through Modern Safety Science
Originally published on Failure is Inevitable.
Jacob Scott is an experienced engineer and enthusiastic participant in the resilience engineering community, having spent time caring for the technology systems powering high-growth startups as well as unicorns like Lyft and Stripe. He is deeply passionate about how to apply learnings from modern safety science to real, complex socio-technical systems.
Blameless SRE Darrell Pappa recently interviewed Jacob to delve into how his research has informed his career journey and experiences to-date, especially in his latest role at Stripe where he helps operate the economic infrastructure of the Internet.
The following transcript has been lightly edited for clarity.
Darell Pappa: Jacob, it’s great to connect. How's it going?
Jacob Scott: There's actually some interesting Adaptive Capacity Labs and Learning from Incidents stuff on this topic, of “How's it going” during the coronavirus. When you meet someone, it's usually like, "How are you?" And, "I'm good." I think it's helpful to be like, "Well, I'm good, given the circumstances." Right? The world is in an odd and unfortunate place, but given where it is, I'm doing pretty well. How about yourself?
Darell Pappa: Yeah, especially with the circumstances, it's been challenging for sure, but I'm happy we can connect. I wanted to first kick us off by diving into your background and some highlights from your journey to date.
Jacob Scott: A long time ago I was actually a theoretical computer scientist Approximation algorithms, combinatorial optimization. I did research in that domain in undergrad. I went to grad school. I dropped out with a master's. And then in 2008, when I left grad school and started my career in industry, I joined Palantir. This was an interesting company to join. It was a different president, different time, but with everything that's happening in the intersection of society and technology today, it’s really interesting to reflect on working in a company at the center of that.
I was there from 2008 to 2013, and saw it go from 100-200 people to over 1,000. I was a backend generalist in Java, frequently a tech lead or lead engineer on a variety of projects. Then after Palantir, a friend from grad school convinced me to join as a very early engineer at a startup doing productized machine learning on sales and marketing data, solving the same problem for many SMBs. That was pretty interesting. I had a horizontal portfolio as there weren't that many engineers—maybe there were 10 or 20 engineers max—and so there I did a lot more. It was Python instead of Java. I did a lot of Postgres stuff, data pipelines, external data integrations, all sorts of random things.
Darell Pappa: Looks like a pretty big change from your previous role.
Jacob Scott: It was, with the intent of optimizing for learning new experiences. It's an approach that I have. Who knows what will make me the most energized and fulfilled? But if I triangulate, I'll learn more by doing something sort of different. You learn a lot, not just technically, but about how the industry works, how Silicon Valley works.
An interesting thing throughout my career, both at Palantir, and then at Infer, Lyft and Stripe, has been the relationship between the business and the technology. There’s always someone on Hacker News saying, "Well, why didn't you use this?" Or, "I could build this thing in a weekend with my buddy." But you can't necessarily build the support arm, the data science arm, the regulatory compliance arm, everything you need to actually have a functioning business. When you're growing and you're the darling of Silicon Valley and you're getting lots of money, you can attract all the right people. This lifts all your boats.
I was at Palantir for about five years, Infer for four years. I left Infer and ended up going to Lyft. At Infer, if you had the AWS root [access], and the log-in, if you were one of the early people setting up the TSDB, basic observability stack, you just kind of did it and hopefully the site wouldn't crash. I went to Lyft with that context.
Lyft obviously has a very sophisticated observability service mesh set up. Matt Klein wrote Envoy at Lyft having seen a lot of stuff at Twitter and AWS that helped inform that. He wrote it with the team at Lyft, and it was interesting to see that technology. Palantir had been early, maybe early 2008 to 2013. People were doing whatever made sense. Then Infer was small. Lyft had an incident program. But despite the sophisticated measures in place, incidents would still arise. Whether they're related to your project or not, they overlap your sphere of ownership and things derailed a bit.
So I got curious about the fact that there were so many smart people — the number two ride sharing company, raised lots of money, very sophisticated technology — but the incidents would sometimes be kind of like, “We didn't have this graph, this detector, we didn't notice this thing."
It went on. Not like a hard failure, not like 500s and everything smashed. But especially if you're dealing with machine learning, something can be wrong and have a financial or user impact failure. You can notice these sorts of grey failures later than you'd like, and you end up with an impactful incident
You wish it wouldn't have happened, and so it's like, "How did we get here?"How do you make sense out of the fact that there's so much success, and so much on fire. That led me at Lyft to move teams. I joined initially to work on problems in the mapping space, which is interesting for Lyft because time and distance play an important role in which driver you dispatch to what passenger, and they also play in pricing. They're low-level Lego bricks. I moved from there to a team focused on chaos engineering and cross-cutting reliability, working on bot-based load testing that was actually a very successful source of reliability for Lyft. Eventually in the summer of 2019, I ended up leaving Lyft, and was lucky enough to spend about six months exploring resilience engineering.
Amy Tobey is at Blameless now, and she's obviously in that community and amazing. Something I got really interested in was: Whether it's socio-technical systems or cognitive systems engineering or the intersection of modern safety science and software systems, there’s this sort of explanatory power or interesting perspective that could say, “Well... If you look at this dynamic system in a much larger sense, maybe you could see how you could get such a bad break. How could Google go down for six hours last summer? How could CloudFlare ship a regex bomb in their WAF rules?” People want functionality, right? You can't test things perfectly.
And so after being lucky enough to spend some time at the South Park Commons, which is a group in San Francisco for people in between, I decided it was time to go and reapply all this. And so I joined Stripe in February just in time, just before Coronavirus.
Darell Pappa: Nice. What is it that fascinates you, draws you to resilience engineering. It seems to me it expands outside of software. Amy just did a relevant panel discussion with Ward Cunningham and Tim Tischler from New Relic and Jessica Kerr.
Jacob Scott: Yeah. I need to watch that. I try to keep up with talks online with various folks in the wider Learning from Incidents community, and ground myself in thinking about modern safety science as applied to tech industry startups, unicorns, etc. Resilience engineering actually is one flavor of modern safety science. There's also, for example, high reliability organizations.
Resilience engineering is this sort of academic practitioner core of scientists and researchers such as Hollnagel, Woods, Dekker who write about it and have studied it for a couple of decades. There's an evolving view of what does safety mean?Leveson. It’s all there in factories, cars, medicine, power plants, software systems. We want systems to do some things and not others, sometimes implicitly. That's where edge use cases have come from. "I would like A plus B to equal C." It's like, "Well, what about overflow, underflow, floats, conversion?" I don't want to think about that; infer what I want. But computers aren't that smart.
When I think about resilience engineering or modern safety science in general, it sprawls very quickly. It’s first really embracing the totality, a larger view of a system.
So rather than saying, "I want a service mesh, so I should use Envoy. And if I have my generate retries and circuit breakers and all these things set correctly, and I have the right dashboards and all of these things, then I'll have a reliable system"...
Resilience engineering would say, "Okay, but who are the people? Who's going to get paged? What's going to happen when they get paged? How ergonomic will the dashboards be? Do you understand these systems? If there's a high-priority feature request from the business, will that draw resources away? You plan to deploy Envoy, does it sort of pause well? Can you go partway into this migration? Are you setting yourself up for a fragile, brittle, all-or-nothing sort of situation?”
Specifically, questions in resilience engineering that I find myself relying on a lot...one is, “How did it make sense at the time?” Which is maybe cognitive systems engineering. People generally don't show up to work to mess up and ruin things for customers or their coworkers. If something bad happened then it's because someone tried to do the right thing, or took a series of actions that they thought would have a positive outcome, but it did not. How did it make sense to them? This is cognitive perspective.
Another is this Safety-II idea that the work is the work. Based on what leads to promotions, what we see leadership not only say but what they do...people decide every day which corners to cut. If there's an aggressive deadline, what tradeoffs to make. A lot of the time that leads to success, some of the time it leads to failure, but it's not that someone flipped a coin at the start of their day and said, "I'm going to do success work or failure work." They did work, and all those sort of latent variables ended up one way or the other.
I think it's from Ryan Kitchens (and the overall Learning from Incidents community) where we get the idea of a perfect storm, or the “nines are useful” line. You think about what sort of incidents you get into, or what you want to improve or avoid from a reliability perspective. There’s this idea that with some of the highest-profile incidents, is that 50 things went wrong at once and they won't go wrong the same way again. It's about contributing factors, and not root causes.
And I don't know when this shifted...obviously people like John Allspaw have been working on this for a long time, and there’s been an increasing presence of folks giving talks on these sorts of things at SREcons and various other places. The interesting challenge now, and one that I'm interested to see manifest in my job at Stripe, is how to take that perspective and map it to success in how to evaluate a process or outcome, in actual systems. Safety is not as interesting to me in the abstract. If you're talking about safety of a hospital, theoretical hospital safety is interesting, but if you can actually figure out how to get better outcomes, even better.
Darell Pappa: Absolutely. I feel that SRE is kind of opening the door to some of this conversation around understanding the human aspect, and it seems it could be a nice gateway into broadening into this huge field like resilience engineering. I am trying to grapple with where you define the boundaries? It seems like it can continue to push out and grow based on the broad feeling of psychology, and extend into how we can drill into distributed causes and effects. Can you share how to apply this specifically in the field?
Jacob Scott: Yeah. That's a great question. There’s the fallacy of best practices, which is to say that what successful organizations have adopted and what works is highly context-specific. In terms of applications and where you draw these boundaries, it's going to depend on many factors: where you are in your company, how your leadership is thinking about reliability, for example.
That's a good challenge, to take this down from the abstract. The advice that Richard Cook and John Allspaw gave, which I’ve come to embrace, is “catch a wave” or “start small”. A clear thing that you can do, which intersects with resilience engineering, is to learn from incidents. The Etsy debriefing facilitation guide is a great place to start. The Adaptive Capacity Labs blog has a lot of great resources on what it means to learn from an incident, as opposed to the many other ways that a blameless postmortem could be used in an organization.
So rather than stopping something or telling people what they're doing is wrong, try to find a cohort of people in your organization who are curious and then do the leading work of exploring perspectives in discussion.
This reminds me of something I thought was really interesting from Subbu Allaramaju at Expedia. He did a review of all the incidents that Expedia had had, and classified them. An interesting thing for resilience engineering to reflect on is chaos engineering and learning from incidents. It helps you bootstrap your ability to learn, in this cognitive way: how to succeed at ensuring this incident isn’t worse.
If a new hire was involved in this incident, instead of the senior seasoned person who stepped in, how would that have been different? The fact that you're looking at a real incident, makes it context-specific to your organization. And it's actually a real thing that happened, which is such a rich source. If you look at chaos engineering or other approaches like continuous verification, there's an advanced mode where you're trying to have things fail a little bit all the time so that you can better understand what those modes of work are. But if you think about chaos engineering game days, you're stressing the system, you make a hypothesis, you're trying to see what happens when you have it failing a certain way. That's a hypothetical failure, in QA or whatnot.
Of course, you can actually see it play out live. In your incidents, you have such a richness of data that is concrete. This is no longer abstract. Your customers are impacted, someone was paged at two in the morning. That's the place to start. I think a lot about high alignment and loose coupling and back pressure. A place where I think resilience engineering is interesting is where people make local decisions. “I increased the 9s of this system. I invested in reliability.” But what gives you confidence that actually improves the reliability that your customers see, or your overall goal?
Mickey Dickerson has an essay in Seeking SRE, where the storage team cranks for a quarter and really improves the reliability of the backend. And then the application teams will improve our latency, we no longer have to make this many retries because the storage is more reliable. You can consume this safety margin generated somewhere in some other part of this complex system, where no one can fully comprehend it all.
That to me feels like one of the ultimate goals: to help understand how a local change can track to an end-to-end result. But this runs through how an entire organization operates. So it becomes both very complicated on a social or a political level, and also very context specific. If you're some executive who's eager or open in a certain way, then it may be easier. If you're crunching to hit some crazy deadline, you may not get much out of it.
It's leading and lagging indicators. You train and do the marathon and hopefully your time improves. Find like-minded people, and learn from incidents in the time that you can budget or free up for leading work. In terms of the outcome, there it becomes much harder to give generic best practices. You're learning a bunch, so keep an eye on what's happening in the organization or when you can track a path to leverage those learnings. That's the “catch a wave” suggestion.
Darell Pappa: I definitely agree; I feel like incidents are the entry point. It’s the area where most people can kind of come together in a sense and really feel the same pain, that gets the conversation started. Now I wanted to bring us back to your time at Lyft, where the team developed some really cutting edge solutions to highly complex problems that come from operating distributed systems. What are some of the most interesting technology challenges you encountered there?
Jacob Scott: A really interesting technology challenge, especially for larger organizations, is this bifurcation that happens between infrastructure teams and product teams, for example what it looks like to migrate to Kubernetes. Should you do it, if you have a group that's responsible for providing strong technical primitives around networking and compute and storage and other resources?
A pattern that I would hypothesize that is frequent for high growth companies, any Decacorn in the Precambrian era, is that your business is exploding and you have five people on the infrastructure team, and then it’s just “hook or crook” it. That is very mutable infrastructure, many pets not cattle, because that works during that time. But then you may get to a point where the properties of this configuration management or service discovery is not as safe or it's kind of clunky. It's probably not doing enough for the customers in sort of product engineering, but also the people who own and operate it are not that happy with it anymore.
But then how do you build the new thing and keep the old thing running? Because that's where all product is running. And how do you simultaneously deliver stuff that is actually what the customers want? From the outside, it's the hype cycle. It's like, "Okay, I'm an infrastructure engineer. So now VMs are old and busted, I'm onto containers, let's do Kubernetes. Kubernetes is going to be awesome. It has its own inertia." But what is the platform that you're providing to product engineers? Is it going to solve their business problems? The value of the infrastructure team is actually observed at the end-to-end impact it has via product teams delivering features to customers.
While I was at Lyft, I saw startups building service mesh products like control planes built on top of Envoy. However, given Envoy’s roots at Lyft, our control plane was one of those first ones. I think about it the same way you think about cell phones and landlines. If you get cell phones before you have a landline infrastructure, you just put towers up everywhere and then you have cell phones. If you have landline infrastructure, it's like, "Well, when do I get a cell phone? The landline is probably working okay. Do I want to really pay for a cell phone and a landline? The landline does half the stuff the cell phone does."
There's a complexity shell game. You have all these stressors, you have all this growth. Well, how do you handle storage? It's probably not the right moment to migrate from Mongo to Dynamodb. But at some point, your use of Mongo may be getting very long in the tooth. And how do you make those calls?
When I think about technology challenges or problems, it can feel like a frog in the pot. You come into a company that's been operating for a decade that has hundreds or thousands of engineers, and all this stuff has been built up. And so it's like, "Is the technical challenge to build a new system? Is the technical challenge to migrate to that system? Is the technical challenge to build a new primitive?"
It may be my own bias based on where I've worked, but frequently, most companies don't want to build their database; they want to use a database. So the technical challenge is, how do you rapidly adapt technology to let people focus on the business logic as much as they can?
And now we move back to resilience engineering. It's the triangle of safety, efficiency, workload. People jump on new gadgets to try and get their work done. And then this pulls you towards un-sustainability. You're like, "Oh no, don't use it this way." But if you make it possible for someone to make a query without an index and it works for them today, then they will do that. And then in three months, you'll be like, "My costs are up." Or "I can't chart in this way." And they'll say, "Well, but I had to ship this thing."
When I think of technology challenges, I think the industry is good at framing problems like: where should we place our Kubernetes clusters so that we tolerate some AZ or region failure? Or, how should we build a CRD? Or, how should our auto-scaler work based on latency or CPU? I think an underappreciated challenge is -- how well do new hires understand this? Does this actually a net positive impact for people trying to deliver values to customers?.
Darell Pappa: That's a good point. Amy has said, "If the short answer isn't in some version of it depends, it's not really the right answer. Or it's not really speaking the truth." That completely embodies what you've been saying here; it really depends on your goals and the situation that your team is trying to work towards collectively.
Jacob Scott: The other thing that's interesting is the role. If you have an incident, there's a default to not talk about it that much. If it's an unimportant incident, just sort of put a blog post out for something small. If it actually impacted customers, what gets written is something that is for consumption primarily by people who are paying you money. It has a lot more considerations than learning.
There's a challenge because you're thinking about all this stuff and you're writing code and you're shipping software, building a product. It is challenging. Avoiding an incident is challenging, but you can open source a library and put it on GitHub. But when you're talking about the concrete application of these things, how we go from “It depends” to “I wrote this thing last week and it made something faster or better”, it's challenging to take those learnings and make them legible globally.
It might be the quality of a dashboard, or you're changing some property of a system like splitting up a service into two sub-services that have different traffic patterns. Your choice there and how it depended and why it was the right choice, just polls on. The threads there go to this deep knot that's existed for the past decade or a few years, or maybe they live in the intuition of senior architects in the company. If you think about the concrete path that connects this philosophy or perspective to clear outcomes, in the same way that shipping a new feature is a clear outcome, it is very context-specific. So it can be hard to talk about outside the organization in which it exists.
Darell Pappa:You’ve also spent some time between Lyft and Stripe working on research. Can you tell me a little bit about that?
Jacob Scott: Between Lyft and Stripe, I went to this place called South Park Commons, which is a community in San Francisco. I was there for about six months. It's for people who are sort of figuring out what they want to spend the next decade doing. That community is great, and I had a lot of fun. While I was there, I had the freedom for self-directed exploration, soaking in material from Twitter, Slack, academic papers, books, talks, stuff outside of software, Friendly Fire by Snook. There's these great examples of complex system failures, and a friendly fire shoot down of helicopters in Iraq by fighter jets.
This is a thing that really should not happen, right? The military has many protocols surrounding how to know who should be fine, where, when, how you coordinate between different groups of people. How you do visual detection of the relay, how to determine what you're seeing before you fire a missile. And yet, this actually happened. When you dig in, these local pieces are not perfectly coordinated. The helicopters, the people who approve the helicopters, the fighter pilots, people who approve the fighter pilots, the AWACS people who are supposed to be discovering everything. So it's a perfect storm.
What many people work on with software is really important. I'm happy that I'm not working in a place where mistakes cause people to die. But it's really powerful to notice and learn from these commonalities or parallels from other domains.
That's another big learning of resilience engineering: the human component. Everyone is making these trade-offs. Everyone is cutting corners. Everyone is looking at graphs and doesn't understand all of them. So there is this tremendous amount of knowledge, this history of incidents that we can learn from.
My time at South Park Commons was a really great opportunity to sort of take a pause from being at a high growth, high velocity software company, to just be curious and swim in the ocean and mull over these big topics in a more relaxed environment. And then, again, to be lucky enough to take that perspective back into industry. How do I take this high-level perspective and what does it take for it to become legible, and how does it really improve things for engineers and customers at a real company?
Darell Pappa: The South Park Commons looks like a really cool place to be.
Jacob Scott: Yes, it's great. I recommend it to anyone who's at that point in their career. It's quite interesting as a community, especially in times like COVID, because historically the physical space has played an important role. But that's resilience engineering. Adaptation is fundamental.
Darell Pappa:If people reading want to get started with Resilience Engineering, what’s the best way to do that at their companies? Do you have any books or other resources you’d recommend?
Jacob Scott: My favorite resources are the Adaptive Capacity Labs blog, and the Learning from Incidents community blog. Twitter is good; there's a few dozen people to follow who are at the center of this community of resilience engineering intersecting with software. Friendly Fire by Snook is also interesting because it's a story. Which resources help most may depend on where you are in your resilience engineering journey.
Are you someone who's like, "Yes, I understand." Or someone who’s like, "Through my lived experience, I understand blame and sanction and the ways in which this can go wrong." In which case, John Allspaw has many videos on YouTube: pick your favorite one.
If you’re someone who's like, "I've heard about this, I'm curious. It doesn't quite click for me", then try the blogs, or a book like Friendly Fire or the Three Mile Island Report, which walk you through real case studies. The places where this is most stark is in environments where it’s like, “How could this happen? We tried so hard. The people here are so smart. We followed all the best practices. How are we still having this incident again? How come what we did last time didn’t work?” Catastrophes happen, cascading failures happen. So the stories where people have gone very deep and which can be teased apart can bring valuable learnings to light.
If you’re interested in joining Jacob and the amazing infrastructure team at Stripe, check out their careers here.
And if you enjoyed this post, check out these related articles:
Get similar sotries in your inbox weekly, for free
Share this story with your friends
Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.