Modern Operations Best Practices from Engineering Leaders at New Relic and Tenable
In a recent industry leaders’ roundtable hosted by Blameless, top experts discussed how teams can embrace SRE best practices and make cultural shifts towards blamelessness. Originally published on Failure is Inevitable.
As reliability shifts left, more companies are adopting SRE best practices. These best practices don’t only include conducting incident retrospectives. The heart and soul of these best practices are a blameless culture and a desire to grow from each incident.
In a recent industry leaders’ roundtable hosted by Blameless, top experts discussed how teams can embrace SRE best practices and make cultural shifts towards blamelessness. The Executive Fireside Chat members included:
- Jon Sakoda, Founder at Decibel Partners
- Nic Benders, GM & VP, Telemetry Data Platform at New Relic
- Dheeraj Khanna, VP, Cloud Engineering and Product Security at Tenable
Below are a few key insights from their conversation
- Finding the sweet spot between speed and reliability: To encourage teams to find a happy medium, leaders need to take a bottom-up approach and adopt a perspective of servant leadership.
- Hiring for growth mindset: Company size is not an indicator of future performance. Growth mindset is a much more important qualification.
- DevOps and observability have shifted ownership: With the service ownership model becoming more common, teams who carry pagers for their services must also understand observability.
- Top challenges for RCAs or contributing factor analysis: With complex systems, pinning down the root cause of an incident is very difficult. Communicating the findings in a way customers will find helpful is another challenge.
- Shake up the retrospective routine: To maximize participation in incident retrospectives, change up the format.
- Learn how to communicate retrospective findings: Determine your “elevator pitch” and be able to explain the key points from your retrospectives in 60-90 seconds.
- Blame should be labeled as a non-valuable concept: It’s important to acknowledge that blame is taking the easy way out. It’s much harder to admit that you don’t know why something broke.
How do engineering cultures change over time? How does company size factor into DevOps and SRE transformation?
Finding the sweet spot
Smaller startup organizations are known for their exceptional velocity. They can ship features in short order to meet the market demand. But sometimes this can lead to a more unreliable product. Teams often build up large amounts of technical debt to achieve product-market fit.
On the other hand, larger organizations are characterized by slower feature velocity, higher levels of bureaucracy, yet often a more reliable product. They’re the steadfast giants who adopt new technologies less often than startups. Yet, these organizations can’t afford to sit on their laurels. They need to modernize as well.
Jon noted how difficult this can be, especially in the case of Dheeraj’s organization, Tenable. “Tenable might have been one of the hardest places to move from a more traditional model to a faster moving cloud agile model. Tenable has so much history: it was on premise, it was open source. To change that to something so dynamic and agile is a big achievement.”
Dheeraj spoke about the difficulty of this situation. “Leaders, or individuals who aspire to be leaders, need to learn how to bring about change in an organization that is not moving fast. It becomes your responsibility to see latencies, to see inefficiencies, and drive change. It is never easy. And in larger organizations, it is 10X harder than what you anticipate.”
The increased effects of each incident exacerbates this issue. As Nic notes, “It's a real struggle to find that sweet spot. You want to go back to that youthful vigor of a small startup where you can do everything you want immediately. But your blast radius on a mistake is so much greater now. You have to balance between both and improve constantly.”
Breaking free of this tightrope means a transformation in the leadership approach. As Dheeraj said, ‘It requires leaders to bring about a mindset shift. And it has to be championed in a way where you're doing servant leadership. It cannot be top down, it has to be bottoms up. You have to win the hearts of engineers, DevOps, SRE, and product managers.”
Leaders, or individuals who aspire to be leaders, need to learn how to bring about change in an organization that is not moving fast. It becomes your responsibility to see latencies, to see inefficiencies, and drive change. It is never easy. And in larger organizations, it is 10X harder than what you anticipate.
Building a team with growth mindset
With great talent comes great ideas. But, sometimes it’s less about the talent than about having a growth mindset. Whether an engineer has come from a large enterprise organization or a small startup, their hunger to learn is a stronger indicator of success.
Nic noted that this is his findings as well. “I've often heard people being concerned about former company size when hiring. If a person is from a big company, are they going to be able to thrive in this fast environment? Or if they only have startup experience, can they work with serious workloads? I've found the prediction ability on this to be very poor. I always look for people who are interested in constantly challenging themselves and learning new things.”
Part of encouraging teams to learn new things can come from a service ownership approach.
If a person is from a big company, are they going to be able to thrive in this fast environment? Or if they only have startup experience, can they work with serious workloads? I've found the prediction ability on this to be very poor. I always look for people who are interested in constantly challenging themselves and learning new things.
Making DevOps and observability everyone’s job
Nic’s significant tenure at New Relic has given him a unique perspective on how observability within companies has changed. He breaks it down into three phases.
- Observability isn't IT: Team members label observability as a production specialty, something that only QA or NOCs handle.
- Bringing in the developer skill set: With the adoption of DevOps, teams are no longer only buying tools; they’re also taking a more hands-on approach. Observability is moving into the DevOps domain.
- Observability is your core to your job: Teams run production software. That means they need to be able to observe it. Teams adopt DevOps not only as a way to increase iteration speed, but also because they carry the pager.
Nic also spoke about how working for an observability company increases the pressure after an incident. “We need to be in front of the world and saying, ‘You can use our tools to run your own systems better.’ That’s a lot of pressure, but it's also part of what attracts a lot of great people. So you have big responsibilities, but it's also an enjoyable challenge.”
We need to be in front of the world and saying, ‘You can use our tools to run your own systems better.’ That’s a lot of pressure, but it's also part of what attracts a lot of great people. So you have big responsibilities, but it's also an enjoyable challenge.
Why is root cause analysis, or contributing factor analysis so hard?
Finding the needle in the haystack
Microservices continue to increase the complexity of how systems interact. This makes finding a single root cause or contributing factors increasingly challenging. Yet, Dheeraj believes that a shift in how the industry approaches failure is upon us.
“If you look at the last 10 years, the tolerance to production failures has completely shifted. It is not that production outages do not happen today, they do happen all day long. We're more tolerant. We have more resilience to do all our production changes. That's the biggest fundamental shift in the industry.”
But, RCAs and contributing factor analyses are still helpful. As Dheeraj said, “The outcome is that we build better quality processes, go very slow roads, and build models where we are not putting customers through pain.”
He also noted that tooling can help. "Companies like Blameless have made it easy. It used to be a needle-in-a-haystack problem."
Companies like Blameless have made it easy. It used to be a needle-in-a-haystack problem.
Considering the human impact of RCAs
Beyond the technical challenges of creating RCAs, there is a human layer as well. Many organizations use these documents to communicate about incidents to customers involved. However, this may require adding a layer of obfuscation.
Nic shares, “The RCA process is a little bit of a bad word inside of New Relic. We see those letters most often accompanied by ‘Customer X wants an RCA.’ Engineers hate it because they are already embarrassed about the failure and now they need to write about it in a way that can pass Legal review.
Dheeraj agrees, and believes that RCAs should have value to customers reading them.
“Today, the industry has become more tolerant to accepting the fact that if you have a vendor, either a SaaS shop or otherwise, it is okay for them to have technical failures. The one caveat is that you are being very transparent to the customer. That means that you are publishing your community pages, and you have enough meat in your status page or updates."
If legal has strict rules about what is publishable, RCAs can still be valuable.
“We try to run a meaningful process internally. I use those customer requests as leverage to get engineering teams to really think through what's happened. Despite microservices making every incident a murder mystery, today's observability means that it is still way better than life was 20 years ago,” Nic noted.
According to Nic, RCAs require more than a deep technical analysis.
“It remains challenging for me to try and find a way to address those people skills and process issues. Technology is the one lever that we pull a lot, so we put a ton of technical fixes in place. But, there are three elements to those incidents. And I worry that we're not doing a good job approaching the other two: people skills and processes.”
Today, the industry has become more tolerant to accepting the fact that if you have a vendor, either a SaaS shop or otherwise, it is okay for them to have technical failures. The one caveat is that you are being very transparent to the customer. That means that you are publishing your community pages, and you have enough meat in your status page or updates.
What are best practices for a postmortem or incident retrospective? How do you know when you are conducting a great incident retrospective?
Creating an internal task force and policies
Internal processes are necessary for crafting incident retrospectives that convey the right information. But teams’ approaches to them can vary significantly. New Relic’s engineering team has created a group called NERF (New Relic Emergency Response Force). These are volunteer incident commanders who help teams during major incidents. These volunteers assist with the retrospective process as well.
However, even with an internal task force, retrospectives can be hit-or-miss. “We have a template, but results are mixed. Sometimes I feel like we're winning and sometimes I feel like we spent an hour together doing paperwork. That's fine, but, it's not really valuable. You're doing well when somebody learns something. I don't care if none of the fields on the form get filled at the end. If we came together as a group to a new understanding of something, then it was valuable,” Nic said.
As Jon succinctly differentiated, “There's data capture, which sometimes just feels like we're going through a routine. Then there's process, learning, and impact.”
Nic also believes that changing the format of retrospectives can invigorate teams. As he notes, “Humans, especially engineers, are very good at minimizing the effort that goes into a process they've done before. We have to change the process. We tend to see a lot of good results after we change. Novelty is important to keep people on their toes and engaging, not on autopilot.”
Nic also spoke about the importance of preparing ahead of the retrospective meeting. “The key is to try to get people to show up with the facts ahead of time, with things like capturing your timeline and your charts. That’s why tools like Blameless are so important, as fact finding takes a lot of time during the retrospective and the facts themselves also aren’t as valuable unless people have a chance to look at them and soak it in.”
Nic’s team also created a policy around mitigating repeat incidents. This policy dictates that teams should come up with follow-up action items that they can complete within the next two weeks. This helps prevent the same incidents from happening again.
Humans, especially engineers, are very good at minimizing the effort that goes into a process they've done before. We have to change the process. We tend to see a lot of good results after we change. Novelty is important to keep people on their toes and engaging, not on autopilot.
Communicating incident retrospective or postmortem results
Incident retrospectives or postmortems are a way for teams to aggregate information about an incident and distill key learnings. Yet, there is more to them than meets the eye. Conducting a blameless incident retrospective requires the right context-setting, both from leadership and team members.
“Leaders should embrace failure as an impetus for change. Engineers should not be decimated for failing instead every failure should be used as a learning opportunity. A very, very important element is that failure is encouraged rather than discouraged in the organization. That's the only way somebody will raise their hand,” Dheeraj said. He also quoted Quincy Adams famous words “Try and fail, but do not fail to try."
He also noted the importance of being able to communicate postmortems to other internal stakeholders. “Postmortems should be able to be communicated with an elevator pitch. If I want to tell my CEO what broke and what we learned from it, I should be able to explain that in 60-90 seconds.”
For product or engineering stakeholders, a more technical analysis is important. Dheeraj spoke about the importance of using tooling to help with this information aggregation. “If I want to deep dive, even that information should be available in a fairly easy way. That’s what tools like Blameless have given us. Postmortems are not a one-time event, it's a long-term event, which shows you areas you can improve and then allows you to actually go fix that.”
If I want to deep dive, even that information should be available in a fairly easy way. That’s what tools like Blameless have given us. Postmortems are not a one-time event, it's a long-term event, which shows you areas you can improve and then allows you to actually go fix that.
How is blameless culture best implemented, not only in DevOps, but in an entire organization?
Eliminating blame as a valuable concept
Blamelessness is a core value of SRE and DevOps. As incidents are inevitable in distributed systems and each failure is an opportunity to learn, blame has no place at the table. Yet, removing blame from interactions can be very difficult.
Before becoming the GM/VP of New Relic’s Telemetry Data Platform, Nic was New Relic's Chief Architect, where he studied the connections between organizational systems, not just technical ones. He noted that there are common traps that senior leadership can find themselves falling into.
“More often than not, it's not the individual who is the problem. There are cases, for sure. But, by and large, people respond to incentives and they operate in the system that you have built. “
He also spoke about why teams often fall into the pattern of blaming: it’s easier than getting to the real answer.
“The important part of establishing a blameless culture is to eliminate the idea of blame as a valuable concept. Blame doesn't help us get to answers, it is just a shortcut to avoid digging deeper. This is true even when you are tempted to take blame yourself. You have to think differently and realize that this is not an individual problem. This is an incentive, economics, or information problem. It's very hard to do. It's the opposite of human nature and you have to push against it constantly.”
Jon notes that blamelessness doesn’t stop with engineering departments. This attitude should be adopted within entire organizations. “DevOps is just one part of the organization that has to embrace a Blameless culture. Ultimately customer support, sales, marketing, and executives are a part of it.”
The important part of establishing a blameless culture is to eliminate the idea of blame as a valuable concept. Blame doesn't help us get to answers, it is just a shortcut to avoid digging deeper.
What advice would you give to your younger engineer self?
Hindsight is always 20-20. But it can be valuable to consider what advice you would give your younger self as it might help out other people. Before our panel concluded, Nic, Dheeraj, and Jon all gave their best piece of advice.
“I have to always remind myself that I didn't get to where I am today by doing things that I was qualified to do. In the same way, as a leader I need to be giving work to other people who might not be, or might not seem to be, qualified to do that work, because that is their path to growth,” Nic said.
Dheeraj also echoed the value of learning and using it to drive change. “Always challenge the status quo. Drive change by virtue of your own energy and initiative. It's never easy to drive that kind of change. But always have that energy and inspiration to challenge the status quo and encourage discussion. Innovation comes with failure. You cannot innovate before you fail."
Jon also left us with sage advice. “When the world is changing fast, playing it safe is not safe.”
If you enjoyed this blog post, check out these resources:
Get similar stories in your inbox weekly, for free
Share this story with your friends
The improved AWS feature allows users to trigger Lambda functions from an SQS queue.
United States Defense Department Asks Amazon, Google, Microsoft, and Oracle to Bid on the JWCC Program
DoD looking to entrust cloud security to multiple vendors.
Google makes fuzzing easier and faster with ClusterFuzzLite
HTTP-based autoscaling and scale to zero capability on a serverless platform