How to Scale End-to-End Observability in AWS Environments

Leadership and Innovation with Instacart's VP of Infrastructure

dustin pearce.png

    Originally published on Failure is Inevitable.

    Blameless CEO Ashar Rizqi recently had the pleasure of interviewing Dustin Pearce in a virtual executive fireside chat and AMA. Dustin is an experienced leader in scaling hyper-growth, cloud-native companies, as the VP of Infrastructure at Instacart and having previously served as Head of Service Engineering at Slack.

    Key takeaways

    Here’s the TL;DR version we pulled from Dustin’s wealth of insights.

    • Behaviors and patterns Dustin set up within his teams that are signals of good service ownership. Dustin believes this is twofold; the cultural factor is even more important than making sure the metrics and data are giving great visibility. Without the culture aspect, nothing else will succeed. Dustin also spoke about getting cross-functional team alignment through SLOs.
    • How to orchestrate alignment to prioritize innovation. Leadership needs to determine non-siloed incentives that help with organization-wide initiatives. Additionally, teams need to remain hungry to avoid resting on their laurels. If they do, it’s likely that their competition will pass them by.
    • What leadership should be doing to help teams in a post-COVID-19 world. This includes compassion, listening, and patience, as we can’t know what our team members’ lives look like during this difficult time. This also requires leadership acknowledging the sacrifices that teams make to keep systems up and running, as well as helping with goal setting that helps people feel effective in their roles. Trust plays a key role in this.
    • Giving visibility into your team's accomplishments, especially when it’s “keeping the lights on.” One way this can be done is by establishing transparency through a catalog of products and services for infrastructure, which includes each service’s towers or stores, parameters, uses, and who staffs that service. This helps manage expectations. Dustin also recommends documenting work systems and requests in order to analyze the time spent and determine what’s worth prioritizing and what isn’t.
    • What service ownership really is and how it differs from traditional ops teams. It’s about creating incentives for innovation and agreeing that delivering delightful customer experiences, not just shipping software, is the best way forward. It’s the opposite of throwing things over the wall. Fundamentally, it’s the concept of you build it, you run it.

    The lightly edited transcript of their conversation is below.

    Ashar Rizqi: One of the reasons why I wanted to have this AMA is because really, the journey that folks have to go through in hyper growth has significance both from a personal and a technology perspective. A lot of attention gets paid to the more outwardly-looking metrics like the number of users that are growing, but what we really miss is that it's about complex systems management. And it's not always about the systems; it's also about bringing together the people, process, and technology to make this smooth-running engine.

    Dustin Pearce: Thanks for having me. A little on my background; I studied science. Microbiology, in fact. My family was into computers. Early to mid '90s, I was introduced to internet programming quite by accident at Genentech, when I was working there and I just fell in love.

    I went to an IT manager in Genentech and said, "Hey, I want to do web pages for the intranet." And the IT manager launched my career because he said, "No, no, no. You don't want to do that here." He's like, "This is going to be huge. You go join a consulting firm."

    So I started off at Lotus Notes Consultant as a web expert even though I didn't know that much about the web. A lot of copy paste later— and I was self taught, just drawing on my childhood experience in programming with my dad, I didn't have a CS degree— I started to develop my career as a professional in internet development.

    The other passion of mine is coaching sports. I think my first head coaching gig ever was when I was 16. I was given a swim team to run. The combination of my coaching experience and my fascination with fixing things and loving the internet drew me to leadership in this space. A lot of my earlier career was spent either doing consulting or healthcare, so it was adjacent to biotech. The stakes got higher and higher, and I started to work on data management systems for pivotal clinical trials for Genentech's biooncology drugs. Consumer internet was a little bit different, and I pivoted towards consumer internet a little more recently. I did the dot com thing—I think everybody did in the '90s—and returned to healthcare. Healthcare was great for me because I was working really close to home, spending a lot of time with my kids, taking them to school. It wasn't the same vibe that you have, where you just kind of live, eat, breathe, and work.

    I was invited to run engineering at babycenter.com, which is a really old website, like 15 years old. There are developers that have been working on the same website for 15 years. It's really remarkable. I was a big proponent of Agile at the time, way back when. I had a lot of really positive experiences building teams and organizations that way.

    I joined Life360, which was about a 100-person startup at the time with maybe a million users. They'd gotten some funding, but definitely needed to get organized and fix the software so it would work.

    I think that was one of the very first times, in my current incantation of my career, where I found a startup that was growing fast with great potential, but either their software or their organization needed to be more robust to fulfill the promise of a high quality of experience for the customers and build a sustainable approach going forward.

    I tried a lot of things and had some success at Life360, probably made a ton of mistakes, but I loved it. We built that team and that product. To this day, it is still one of my favorite products.

    I loved the people, I loved the product. It was just great. But then when Slack calls to invite you to help them, it’s like the chance of a lifetime. The product that I loved, the culture felt like tailwinds. I knew it was going to be hard. I didn't know how hard, but I knew it was going to be hard. I let the Life360 folks know that I'm going to chase the dream, and I went to Slack.

    And from the minute I hit the ground running... I think my second day was one of the biggest outages ever of Slack, which is a familiar refrain. I don't know if I'm cursed, but when I come to companies, they have a lot of organizational debt, a lot of technical debt, and move super fast. There’s no kind of, “Hey, let's slow down and catch our breath so that we can catch up,” because you've got to move.

    When I was at Slack, I said, "One of the things that really worries me is that the pace that we're moving right now is uncomfortable for humans."

    Human beings, I think, have a finite capacity for change. They need to understand their universe, what success looks like, and what's expected of them when they wake up in the morning, and it's always changing. It's just very unsettling. And I said, "I wonder if this is the new norm."

    This is not just an elaboration of the fact that Slack's moving very fast; this is actually just how business is going to be, and it's going to get faster. I started thinking about this when I was there: “What do we do in that space and how do we adapt our organizational thinking to adapt to the speed with which we have to move?”

    I think a lot of the traditional concepts that come up in that space is, do you buy or do you build, how much technical debt do you accrue, when is it too soon to really focus on organizational things, and organizational systems and work systems versus just getting stuff done as fast as you can.

    The concept of service ownership emerged for me there, and it became much more of a passion of mine. Having teams that are federated and move very quickly and somewhat independently which have ownership of a customer experience is the software organizational strategy du jour, if you will. But I don't know that the overall organization knows how to cope with that. How do you have this ability, how do you coordinate, and how do you drive efficiency and operational excellence in an environment that is so federated? Service ownership was a big answer, and focusing in that infrastructure space gave me a lot of experience there that I was excited about.

    Instacart was essentially a replay. I really believed in the online grocery business long before the COVID crisis. I joined Instacart, and as we know, within a couple of weeks, all hell broke loose. We went through a pretty remarkable time.

    Ashar Rizqi: That's a phenomenal story, Dustin. You didn't have a degree in engineering, but in microbiology. You went from that into technology very quickly and worked to understand the deep pain points associated with scaling technology to meet the needs of users and providing good experiences. Now you’ve applied that to a very different context at Slack and now at Instacart, where literally the world is changing in front of your eyes. It's a completely unprecedented scenario.

    You've talked about service ownership and I would love to hear a little bit more about your passion for that. You talked about how you think about reducing the cost of service ownership and you mentioned the word customer experience—teams owning customer experience. I think that's really unique. Could you share a little bit more about what behaviors and patterns you set up within your teams that are signals of good service ownership, or conversely, what you have seen in terms of poor service ownership?

    Dustin Pearce: There are organizations that have a strong sense of culture around service ownership and building on it, and there are organizations that are optimized for zero-to-one shipping. So they're good at making things, they're not as good at owning things. Or they build different organizations to own things.

    When we talk about service ownership, it's rooted in the theory that that division of ownership and making is one that diminishes the quality of the product, creates perverse incentives, and ultimately contributes to the technical debt and drags the overall velocity of the company down.

    At Slack, I used to walk around with my own pithy phrase: “We don't ship software, we craft and care for customer experiences.” We are not shipping boxes here. We create the software and we own it for you. You are dependent upon our ability to provide the service 24/7, 365, especially something like a communications platform. It's like the dial tone of the internet. You really need this visceral sense of ownership, and that means that you're on call and you have to have strategies for managing that. It's not free.

    I still try to figure out how to deal with the struggles I have seen in service ownership. Let's say a typical SRE organization will have a production readiness checklist. It's this 70 page tome of things that you're supposed to think about to write resilient, effective software. Then you'll have a director or a VP of quality, and he or she is banging a fist on the table saying, “Work your backlog and bugs down.” You have the customer service VP, and he or she is saying all we have these customer complaints we need to fix. And then the CSO comes in, and she or he is saying, “Hey, these security vulnerabilities need to be done with an SLA.”

    What I find is that, in these models, especially models that do embrace service ownership, a lot of this collides with the engineering manager. Engineering manager is becoming a significantly difficult role in this world. It is no longer necessarily just supporting engineers and managing your Agile backlog. You have all of these stakeholders.

    Unfortunately, a lot of the stakeholders, like senior leadership, really only have very coarse instruments. They can bang their gavel and say, “We're going to focus on security for the next three months,” or “We're going to focus on technical debt for the next three months.” The company mobilizes this kind of response to stimulus. But it's not sustainable, and it's not even. Not every team, not every service area of your software requires the same investment at the same time. Quite the opposite, especially when you use models where you have federated teams that are making decisions every day about their software, about their users.

    The model that I've seen work, and the one that I'm really chasing now like the golden goose, is how do you provide feedback to those teams to make decisions on a daily basis that are smart? So, how does infrastructure really become a decision support team, as much as it is a platform team? That is probably the thing that people underestimate the most.

    There is no canonical, correct answer. I've seen attempts to do that globally, and it just never works. What’s important is empowering EMs to make great decisions, rewarding your frontline engineering managers for the courage to make a decision, even when it's wrong, and celebrating that.

    Modern service ownership is a lot about observability. Charity Majors does a lot of great work there. If you don't know who Charity Majors is, she’s just so much fun. Every tweet, everything she writes is just a blast to read and very informative. P90 is not a thing anymore. I can't just write off 10% of my users' experience and say well, P90 is good. So I have to have almost individual, even across tens of millions of users, visibility into those experiences. That's really where the future and the direction service ownership is taking now.

    Ashar Rizqi: You used the example of the infrastructure team at Instacart. It's a decision support team where instrumentation and observability is really key. Can you dive a little bit deeper into what signals or signs a team looks at, to help make those decisions? Of course, that assumes that you have a culture within the company or team that you've created that gives them the freedom or safety to make those decisions. So if you tell us about both of those? What are specific examples of data that shows how those things should be set up and hooked together, and then how do you set that culture?

    Dustin Pearce: I'll start with the culture piece because without that, it really doesn't matter what tools I provide the people on the front lines to make decisions. So I selected Instacart because of their kind of really intense culture of ownership. They were ready. They have this sense of, “Hey, we own this. This isn't ops’ problem. We have to solve these problems.” And we'll talk about how to reduce the cost of service ownership.

    Intrinsically, companies reflect their founders. The leaders are very much instrumental in shaping that culture, and there are all kinds of great phrases that I've heard over the years one is that your culture's defined by the worst behavior you'll tolerate. That's the advice they give leaders to give them the courage to take a stand for the things they believe in. And yes, please do that. Maybe that sounds obvious, but maybe it's not for everyone.

    If I said the perfect culture is some place that celebrates courageous decision making on the front lines by engineering managers, then you have to invest your time as a founder, as the CEO, as a VP, as a director, as the senior manager, to celebrate the fact that the senior manager, hopefully with the support and alignment of their team and not autocratically, is making courageous decisions. And even when they're wrong, you're celebrating the fact that they're doing it right. There's no shortcuts there, and you have to be consistent. And guess what? You'll lose sometimes.

    I'm really competitive. I'm never a big believer in celebrating failure. It's expected, it's part of the program, you can't expect to always win. But I think that as the company, if you just invest your time and energy and reward and incentivize the behavior that you're looking for, you'll get the culture that you want.

    On the flip side, there are lots of different signals that each individual area is already using. If you're in the SRE world, I'm sure you've heard the term SLO. Looking for those reliability or service level indicators, and defining some objectives for how you measure reliability or availability or correctness have long been around. This is not like we're inventing signals here. I think the key thing here is how do we pull those signals together in a cohesive look.

    For example, these are all the vulnerabilities that I'm responsible for fixing. This is my bug backlog that's pretty well curated, and I have a good sense of what are the critical bugs that are outstanding. Here is my SLO performance, and here is this dashboard.

    Now I'm asking people to make backlog and prioritization decisions based on balancing those things. When they're out of balance, those metrics should tell you that. The nice thing about this is that not only does revealing those signals from quality, reliability, performance, and security to the team that supports decision-making, but it also gives leadership an overview of where the investments need to be made. That was the promise of SLOs in the beginning. You can't just measure your top line availability, your nines that you publish on your website. You have to understand the systems and the teams that own those systems, and the SLO is the promise. If I commit to making that transparent, and we all agree that that's an accurate measure, that the team can now optimize against that and make decisions against it.

    So it's taking that same SLO concept from the SRE book and really just applying it across the entire infrastructure kind of portfolio, whether that's quality, security, reliability, and performance.

    Ashar Rizqi: I love what you said about needing top-level KPIS or metrics for your particular domain or area whether they're SLOs or vulnerabilities and then trusting your team, establishing that sense of safety for the front line managers to make those decisions.

    That's a really powerful framework because many folks that we've talked to are unable to connect the dots, especially in very large enterprise companies that have been operating in a certain way for decades. It becomes challenging; there's just so many moving pieces that it becomes really hard to bring it all together.

    Dustin Pearce: It requires teamwork from leadership. You have to have systems thinking and teamwork, and not local optimization for your metric. As a CTO, or how are you incentivizing your senior leadership to operate as a team? That's key to this concept as well.

    Ashar Rizqi: What are examples where you've seen getting that level of alignment working? Let's just use SLOs for the sake of this example. It's really easy to get a dashboard and a particular tool. That's not the most challenging part. One of the biggest challenges is orchestrating the alignment, and that also goes down to the common language that people are going to speak. Can you share with us, how do you have that conversation? How do you set that context amongst the leadership team?

    Dustin Pearce: Fundamentally, I think most companies now agree that you've got to move. You've got to innovate, and you've got to capture marketing growth fast, because if you don't, somebody's going to blow by you. It's just the way it is. The machinery and the opportunity to start a company and go really far with three human beings and a laptop. You can get really far with just almost nothing. You can just sign up for SaaS everything. There's way more competitors, and they're all moving faster, and they're all hungrier than you, so you have to always remember to move quickly and be efficient.

    I always hated “ruthless prioritization”, but I never found a better term for it. You shouldn't be ruthless in anything. The idea that you are vigorously prioritizing and always challenging your assumptions and never sitting on your laurels is really important. I think that most leadership teams all agree, especially in this kind of startup software space, that is the kind of thing that connects us.

    So how do you take that commonality and apply that to all of these elements? Because a lot of the elements potentially are hidden costs. I can't tell you how many stories there are of companies that have accumulated significant technical debt, only to spend 18 months trying to dig out, to watch a competitor go right by.

    It's like, product/market fit, product/market fit, product/market fit, product/market fit, product/market fit. Then all of a sudden, it's like we have product/market fit, and because of cell phones and the way the world works, you are instantly at scale, and you have no sense of ownership or organization, and now you have all of this debt you have to pay off, and product innovation goes to zero. I've even seen shipping literally to go zero, where it's like, stop the press, stop everything, we cannot ship anything until we fix X, Y, Z.

    That's catastrophic. Most people that I run into in this kind of space understand that's out there. What I try to do is try to find ways to give teams much more visibility, long before they reach catastrophic levels. What is the actual hidden vulnerability in security? What is the actual cost of a data warehouse? Everyone's got this data warehouse that's a total wreck and it's like, oh, I'll just outsource it to Snowflake, but that doesn't really fix the problem.

    Helping people see where these things are by using data to reveal aligns the leadership with the problem. You should invest incrementally in it without completely sacrificing our product innovation pipeline, which is the ultimate balance. As infrastructure, it'd be very easy for me because what does the board want from me? They want five nines, and they want ability.

    But if I'm saying that I care deeply about product innovation because that's what's good for the company, I'm already demonstrating that non-local optimization for the greater good. And I think that helps build the bridges that are required to get aligned.

    Ashar Rizqi: I think what you're trying to get to is that ownership aspect. It would be easy to optimize for five nines. That doesn't necessarily mean that there's a sense of ownership because if that's all the company thinks about, then you're right. Things like innovation actually suffer.

    Let's switch gears to one other topic that I want to talk about, which is obviously on everybody's mind: COVID-19. What are some of the behaviors that are changing due to this? What are you doing, or what do you think should be done differently in this post-COVID-19 world to really bring the team together? What does that team cohesion actually look like? What does that mean?

    Dustin Pearce: It's hard. On some level, that's the question of how do suddenly remote teams create human connection and retain some amount of alignment that was coming naturally with co-locating. I think people are learning that perhaps co-locating is not providing the amount of benefit that they thought it was. I've worked in infrastructure for a little while now, and I have contended that I would put tons of engineers in a room and they would just talk to each other via Slack.

    I've also been a member of software development teams that were high bandwidth and we had a team space where we could turn our chairs around and talk, and it was awesome. There is that kind of benefit, that promise of interaction. Those connections are missing right now. Instacart faces the same challenge as everyone else where we have employees who have children at home. They had a global pandemic staring them down. They are personally at risk, they have parents at risk, they have friends who are at risk that they're very concerned about. We've always had this long standing sense of leadership combination of setting a very high bar and striving for excellence, and then really compassionate coaching and walking alongside team members towards that bar. I don't just stand at the top and scream at the cadets to get up over the obstacle. I'm lifting them up and looking for ways around or how to figure out something else for them to do in order for them to feel effective and successful in this environment.

    Leadership has to be in a support mode always, but now more than ever. You should be taking that “leaders eat last” attitude to support your staff and help them find ways to feel effective. Metrics-driven management can be tyrannical, but it can also be empowering because now, instead of coming into work and hoping the boss gives you a Scooby snack in order to define whether you're successful, you have a meaningful metric that you can measure yourself.

    In this world, those types of tactics are very meaningful; giving people missions, metrics, and 100% the flexibility to execute the way that they can— not even the way they want to, just the way that they are capable of. We need a lot of patience, compassion, and listening. I need to assume that I just have no idea what's going on inside your household.

    That has been true through the entire crisis, even from the very beginning. I think what was really hard for Instacart was that at a point, we were doubling as a company every four-ish days. It put tremendous pressure on the overall system. Bringing on 500,000 shoppers in a matter of weeks puts on tremendous pressure. It isn't just an infrastructure thing; every human at Instacart was under a lot of strain and pressure.

    Culturally, we really fed on a sense of responsibility to the shoppers and to the customers. During this crisis, we would get notes and encouragement from people who said, “This is critical. I wouldn't be able to feed my family without you. We felt this tremendous sense of responsibility. A lot of us made some significant sacrifices, both in time and energy, to try to keep this thing going through that peak around mid-April.

    Being tethered to a computer seven days a week for two months is not free for anyone. Especially the person who's doing it, but also for the family behind them. As a company, we're doing a lot to try to make sure that people have space and time to kind of recoup, to do what they feel is necessary. I'm proud of what Instacart did and what it continues to do. I don't pretend like we've got it figured out. In many ways, I feel like I'm always trying to find ways to prevent burnout and frustration, and help people feel like we do care about them as a person and as a leader. It is a very, very challenging time to do work and just live. So kudos to everyone out there who's trying to sort it out. I look forward to hopefully, someday, someone will be like, "Dusty, I've got it figured out. You're doing it wrong. Let me tell you how to do it right."

    Ashar Rizqi: One thing you touched on that I think you're absolutely doing right is this notion of patience and compassion. I personally haven't seen enough of that. For leaders, particularly folks in technology leadership positions or who are thinking about building a career in technology leadership positions, patience and compassion are two absolute must-have characteristics. Like you said, seven days a week, you're on your computer all the time, and there's this global pandemic going on, there's this heightened sense of fear. Not just the virus, but all the other events around racial injustice happening out there. It's a lot to take in.

    Hearing that the leaders are out there practicing compassion, compassionate listening, and patience with their teams will go a really, really long way once these crises are over. When the next crisis comes around, people have memories. They will remember that these were the leaders that were there and stood by our side. It can be powerful.

    Dustin Pearce: Hyper growth is not that different than any other software, it's just you've got to be a little bit more focused. What's counterintuitive for some people, at least I have found, is you would imagine that a hyper growth company must be tight; you've got to squeeze, you've got to focus and deliver, and everything's got to be firing on all cylinders. Oftentimes it’s the opposite. You need to accept a lot of leaky oil, and you've got to make peace with it and build trust. It's all about trust. I tell people all the time, hyper growth has growth right in the name. We have to grow. And the only way that we can grow is expanding, adding people, and giving our trust to them on day one.

    It's not, “Hey, welcome to Instacart, prove you belong here.” No way. Day one, it's like, “Welcome to Instacart, we can't wait to see what you can do. What would you like to try, and how can we trust you more?” You have to err on that side. As a leader in these hyper growth situations, if you can show some vulnerability, some humanity, like “Hey, I'm just another member of the team that's just doing a different job than y'all, and I'm doing my best. I trust you all and we're in this together,” you will probably have much better results than someone who is vigorously trying to execute by the book. It's not about getting an A on execution. It's really getting an A on humanity.

    Ashar Rizqi: It really sounds like you're building something very special at Instacart. I'm really happy to see that. Now for the audience questions. The first is, how do you make sure the leadership team sees your team's contribution in maintaining technical debt long below the level of a catastrophe.Many people in leadership won't understand this importance until a real catastrophe happens because, “Hey, the lights are on.” However, we don't get rewarded for keeping the lights on.

    Dustin Pearce: Teams who really struggle to feel appreciated because the nature of the work that they do often experience very high levels of context-switching as they're supporting a lot of different things. Oftentimes, it's invisible.

    One interesting side note about that is there's the idea of, “We keep the lights on, but it's a mystery in how we do it.” This has some advantages from an engineering perspective: people leave you alone. They're not going to be in your grill about your burndown chart. I often tell infrastructure engineers who've never had experience in product engineering, “You would really, really not like the micromanaging components of Agile.”

    I guess the problem with that is that at some level, we say things like, “We keep the lights on, and it's a mystery,” but we can't keep the lights on. It's not possible in this world. We can't ensure 100% uptime. We cannot, as an infrastructure team, guarantee that nothing will break the website. Failure's built in. Of course, when things break, everyone's like, “Well, what's going on? Where are you? I thought you kept the lights on.”

    Our response is very reasonable, which is we can't own everything. I can't review every PR. I don't write the theories.

    To unbreak that cycle that I've seen so many times, we focus on two things. One is a catalog of products and services for infrastructure. Imagine, if you will, a Sears catalog, with the corny pictures and each page is a product. The product could be an availability, like a Dead Man’s Snitch availability measurement tool that we offer and the teams can use to measure their own availability. Or it could even be something as robust as the abstraction tooling that you use for your deployment chain.

    A service is something where a human needs to provide something. In the catalog, the service has towers or stores, it has parameters, this is how we use the service, this is who staffs that service. And so the catalog defines what you are capable of, and it's all about managing expectations. That's the first thing that you have to have.

    Now in infra, in an ideal world, we could just operate off that catalog and if it isn't in our catalog, we don't do it; find someone else. But there is no one else. So we have to operate off-book.

    You have to couple your catalog of products and services—which honestly could just be really well-written OKRs—with work systems and work systems information. If you don't have a work system in place that is helping you manage and track interrupts, if your engineering team is not committed to investing the time it takes to turn a DM into a Jira ticket, even if that takes longer than actually fixing the bug, you won’t have visibility.

    The reason that you're investing that time is that the work system is protecting the team because you can accumulate information about all the interrupts and contacts, analyze 30 days worth of that data, and make a decision as to whether we should or shouldn't be investing that kind of time. I don't think that you can do it up front because the business needs what the business needs.

    You can look at repetitive requests and people who are consistently under planning and coming to you at the last minute with their designs when it's well-stated that you should be including us early on and have a conversation with them.

    The way that we give visibility into our team’s contributions is the products and services catalog, in which every product has KPI, and the work systems. We do operational reviews, like what is the nature of the interrupts that you're servicing, what's the kind of work that you're doing off book, and we present the two together.

    And I think the combination of those two give enough visibility to the org that they have reasonable expectations of how we can help them, which is more than just keeping the lights on. Because that's unreasonable. Nobody can do that alone.

    Ashar Rizqi: We have another question here. What exactly does service ownership mean? Is it deciding what to ship, how to test, who's on PagerDuty for the platform? Just like cloud washing, I'm seeing a lot of SRE washing, where what looks like a traditional ops team is calling themselves an SRE team, and that's creating a view for developers that the SRE team is just someone there to run their software.

    Dustin Pearce: Seeking SRE is a good counter book to Google’s SRE manual. The Google SRE manual, where I've got a piece of software that's 10 years old that supports a billion users, is the unicorn of a problem. This idea that I have a specialized team who only focuses on ownership is very unique. And most of us are not in that situation. Most of us are in a very different situation where we are very, very under-resourced and the software's changing very rapidly. Gmail is not changing dramatically from quarter to quarter. There's a misnomer that one of the ways I can get more product velocity is by taking a lot of these ownership ideas or activities and outsourcing them off to ops folks, SREs, whatever you want to call them, people who are good at ownership.

    Inevitably, the decoupling of the people who are creating and the people who are owning creates problems, both in the software and in the process. The quality goes down. We saw this a long time ago with QA. QA has gone through a revolution, and part of it was this idea that if the development team had no responsibility for quality or testing, the developers learned how to use their QA team to figure out if their software was done enough. They'd just throw it over like, “Is it done? No, okay, cool. I'll work on it some more.” That is not only a waste of the QA team's time, but it's just bad behavior. We see that over and over when you separate the concerns.

    But when you merge the concerns, and when we talk about what service ownership means, we talk about all the things the teams have to worry about. That's really critical because a team can get so mired in owning their infrastructure that they're not innovating the product, and that would be a very, very bad situation. If we all agree that the most important thing is moving fast and innovating for our customers and delivering delight at a reasonable cost, then you can't have your product teams just debugging all day. That's just not adding to their product.

    But fundamentally, it is this concept of you build it, you run it. If there was no infrastructure team, if there was nobody who was doing Terraform, if there was nobody responding when the database just seems to disappear, what would you do? How would you manage that? That is the totality of ownership. It is security, performance, reliability, and quality. Sometimes I forget all four, but I try to focus on those four characteristics.

    You know good ownership when you see it, and you can relate that to just about anything, home or tools or software. When someone is an attentive owner, it's just a better experience. The software is better. Everything is just easier. The development team's life is better. Because yes, they are responsible for defining their on call rotation.

    The classic example I've seen is “Hey, we need to put our devs on call because we're doing service ownership now.” And your ops team, who was the only one carrying the pagers, is like “Oh, thank god, that's great.” And the developers say, “Whoa, whoa, whoa, I don't know what to do. What do I do when the pager goes off?” And the ops team just loses their mind because it's just years of frustration of carrying the pager.

    But more importantly, the managers don't know how to manage on-call rotations. So they don't know that they need to pay attention to people that are on call, and that the fact that just because you didn't get called doesn't mean that it didn't impact your life. You planned around it, your family planned around it, you're carrying a laptop around. That costs you. And every manager needs to be trained in that.

    So I think that service ownership is that full vertical, as well as all the ways that we can support the teams in that space and give them either training, practices, tools, or platforms, and sometimes embedded experts. Embedded SREs is a classic, favorite trick of mine when teams really struggle with ownership, where they're there to help the teams understand the areas that we need to invest in to have a better experience, either for our customers or for our employees because that's what we're always focused on.

    Ashar Rizqi:The key thing here around service ownership, like you said, is it's not just throwing someone into the deep end. It is, “Hey, here's some equipment to help you swim,” or “We're going to jump in with you for a little bit to teach you how it's going to get done.” I think that's absolutely critical. There is very much that failure mode. If you don't address that with the right culture, mindset, compassion, and patience, it is actually throwing it back over the wall. You're just throwing it back and saying, “Hey, not my problem. Now this is your problem, thanks again.”

    Dustin Pearce: That's what it feels like to dev teams. I've seen way too many times: SRE teams, ops teams, or infrastructure teams get excited about service ownership, and they're like, “Hey man, I don't want all the stuff anymore. It's your problem.” What they don't understand is that I've got a product manager, a CEO, a designer, anyone with an opinion on whether it should have square or round rectangles, all trying to tell me what to do at the same time, and I still have to deliver it on the sprint. That is a very tough existence. There are a lot of problems in that space. And now you're just throwing a bunch of eggs at them and expecting them to just like yes, please more.

    I know you have 70 things you want the development teams to do, but they just have 99 problems and they cannot absorb another one. So how do you bake this down? The challenge I have with the security team right now is if you built a software development team in a lab that was perfectly security aware and conscious, what would be the behaviors that you would expect to see from that team? Is it how they would do PR? What is it about that team? Is it how they approach documentation, or what is it?

    And then once you can define those security-minded behaviors, how do you instrument those behaviors on the teams to see if they're there or not? That goes back to this idea of focus; you can't fix everything. It's back to this kind of vigorous prioritization of what matters. And trust your security team. If they tell you that this is the behavior that matters, then invest in it and see what happens. Maybe they strike out and it's zero, but I still think that having a target and missing it is better than just trying to paper over losses and make everything look like a win.

    At Instacart, the data security, quality, and reliability all report to me. I have the opportunity right now to experiment in connecting all those things. So for me, it's a fertile ground for me to learn. When I talk about local decision-making and the KPIs, that's all aspirational. I've never seen it work this way. This is just my vision of what I think it should look like. And if somebody out there does have it working, I want to talk to you.


    Get similar stories in your inbox weekly, for free



    Share this story:
    blameless
    Blameless

    Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.

    How to Scale End-to-End Observability in AWS Environments

    Latest stories


    How ManageEngine Applications Manager Can Help Overcome Challenges In Kubernetes Monitoring

    We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …

    AIOps with Site24x7: Maximizing Efficiency at an Affordable Cost

    In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …

    A Review of Zoho ManageEngine

    Zoho Corp., formerly known as AdventNet Inc., has established itself as a major player in …

    Should I learn Java in 2023? A Practical Guide

    Java is one of the most widely used programming languages in the world. It has …

    The fastest way to ramp up on DevOps

    You probably have been thinking of moving to DevOps or learning DevOps as a beginner. …

    Why You Need a Blockchain Node Provider

    In this article, we briefly cover the concept of blockchain nodes provider and explain why …

    Top 5 Virtual desktop Provides in 2022

    Here are the top 5 virtual desktop providers who offer a range of benefits such …

    Why Your Business Should Connect Directly To Your Cloud

    Today, companies make the most use of cloud technology regardless of their size and sector. …

    7 Must-Watch DevSecOps Videos

    Security is a crucial part of application development and DevSecOps makes it easy and continuous.The …