Twitter’s Reliability Journey
Originally published on Failure is Inevitable.
Twitter’s SRE team is one of the most advanced in the industry, managing the services that capture the pulse of the world every single day and throughout the moments that connect us all. We had the privilege of interviewing Brian Brophy, Sr. Staff SRE, Carrie Fernandez, Head of Site Reliability Engineering, JP Doherty, Engineering Manager, and Zachary Kiel, Sr. Staff SRE to learn about how SRE is practiced at Twitter.
As a company, Twitter is approximately 4,800 employees strong with offices around the world. SRE has been part of the engineering organization formally since 2012, though foundational practices around reliability and operations began emerging earlier. Today, SRE at Twitter features both embedded and core/central engagement models, with team members that hold the SRE title as well as those without but who perform SRE responsibilities. Regardless of their role or title, a key mantra among those who care about reliability is “let’s break things better the next time.”
To learn about how Twitter’s reliability practices have evolved to support their explosive growth, let’s dive into their SRE journey together.
SRE’s beginnings: Accelerating the growth of operational excellence
While the SRE team was officially declared in 2012, its practices predated 2012 in that there already existed operations and software engineers who had incredible focus and impact around reliability.
Prior to 2012, Twitter had a traditional ops team with sys admins who were responsible for data center operations, provisioning systems, DNS, package management, deploys, and other typical responsibilities. The release engineering team was responsible for deploying the Ruby on Rails monolith, and built testing and infrastructure around safely rolling it out. At the time, with the monolithic codebase, there were fewer service boundaries as well as limited service interfaces and APIs, making it difficult to detect problems.
These early reliability-oriented teams began working on developing signals to monitor resource utilization, rolling out alerting across their instances through a phased approach. Simultaneously, the engineering organization began decomposing the monorail onto scala microservices. This led to a restructuring from one engineering team working on a single monolith to multiple service-oriented teams for core product areas such as Tweets and Timelines.
SRE emerged as a practice to provide operational experience and readiness across these distributed core software teams, beginning with embedded SREs within teams. Over time, as the architecture went from one service to hundreds and thousands of services, a dedicated SRE team also formed: the 24x7 Twitter Command Center, also known as TCC.
The SRE team’s remit and the importance of blameless culture
Twitter SREs have pioneered key initiatives to drive consistency and repeatability, including the following:
- Deployment and testing: A single deploy process, redline testing, synthetic load testing
- Durability as well as instrumentation and metrics across the system to provide situational awareness
- Large event planning, incident management and postmortems
The TCC also monitors visualizations of infrastructure and product health in realtime, as well as drives incident management when product and/or reliability impact occurs.
Such practices and resources allow the team to be prepared for the questions that they have trained themselves to ask, and to drive consistent response across engineers, such as:
- Will the system support the load expected?
- How much capacity is needed?
- What are the next bottlenecks?
Large-traffic events are particularly interesting: the spikes may not necessarily be larger than daily peaks, as Internet traffic on Twitter ebbs and flows throughout a global day. The key difference is that a daily peak might take several hours to gradually ascend and then come back down, whereas with large events it can be inside of a minute, creating far greater risk of instability. Engineers use stress and failover testing to inform event preparedness. The team creates playbooks to prepare for large traffic events that begin with key summary details such as who’s on-call, the various ways to reach them, escalations, and top-line metrics around the product area to determine if the service is in a healthy or unhealthy state. They playbooks also dive into how to anticipate what could go wrong based on existing knowledge about the systems and what agreed-upon actions to take if any of those occur. As Carrie points out, “Twitter also has automated infrastructure that takes corrective actions based on various SLO performance degradation”.
In the aftermath of incidents, the team adheres to a blameless postmortem process. The primary objective of the postmortem is to surface contributing factors that can be intelligently tackled, and to generate follow-up action items (as well as consensus on prioritizing follow-up work). With better documentation and focused action to address gaps in rate limiting, alerting, or other processes, the team is aligned on how to prevent a similar incident from happening again.
An important feature of Twitter’s blameless culture is the distinction of accountability looking forward instead of backward, so that team members are not held accountable for past decisions that may have been mistakes or oversights. Team members assume positive intent, and that their colleagues made decisions based on the tooling and context they had available. Rather, the typical notion of accountability is flipped on its head, so that it focuses on what shouldbe done to address the vulnerabilities that created room for the failure. In other words, did something about the tooling lead team members to a suboptimal result, and what can be done to improve that?
Carrie adds, “during the postmortem process, it’s often more straightforward to focus on a defect or bug as being the root cause of an incident. The consequence of that approach can mean similar incidents may continue to occur, but in other services or with a slightly different presentation. Twitter has evolved its postmortem process to also look at process failures and opportunities to automate those processes in order to systematically address potentially broader issues that are presented during a single incident.”
SRE structure and staffing
While there are individuals that are hired into Twitter specifically as SREs, some also laterally “branch out” from internal teams into the SRE role. Furthermore, while there are some who formally have SRE as a title, Twitter also has software engineers and platform teams that may not have reliability in their title, but it's core to their skillset and the foundation it plays in the infrastructure.
Regardless of their title, Twitter is fortunate to have extraordinary service owners who understand the dependencies in a multilayered distributed service architecture, and when anomalous patterns arise, they are the first ones paged due to their ability to triage complex situations.
Furthermore, as reliability has been such a core aspect of Twitter engineering’s evolution over time, it is hard to isolate the impact of SRE. As an example, decomposing the monorail into microservices was supported by Finagle, Twitter’s RPC library which implements many core reliability features. Maintained by Twitter’s Core Systems Libraries (CSL) team, Finagle delivers many of the reliability engineering features to Twitter’s service stack, including connection pools, load-balancing, failure detectors, failover strategies, circuit-breakers, request timeouts, retries, back pressure, and more. While Finagle wasn't built or delivered by SREs, it is a big part of Twitter’s reliability story that has gotten the SRE practice to where it is today.
How SRE shapes Twitter’s biggest moments
Brian, Zac, and JP also point to leadership support as essential to making reliability a first-class priority, which helped catalyze initiatives and hiring to support reliability. Due to Twitter’s unique place in the public dialogue, crashes due to events with huge influxes of Tweets like New Year’s Eve and the airing of Castle in the Sky received outsize attention. While the Fail Whale was originally intended to showcase the fact that scalability fixes were underway, it also became a signal to leadership of reliability’s importance to business success and public perception. The organization as a whole became relentlessly focused on how to “fail gracefully” and minimize the impact of future large-scale events.
The preparations for the World Cup in 2014 were an especially proud moment for the team. The engineering organization implemented synthetic load testing frameworks to simulate spikes of traffic in events and their potential impact on core services. They went through a production readiness review process (which can be thought of as a pre-production launch) to audit their runbooks, dashboards, and alerts. As part of the exercise, the team reviewed their load shedding playbook, which dictated the order of steps should there be a need to shed load to the site during a catastrophic scenario.
The game day prep helped the team prioritize what they cared about and the actions they were going to take, reducing risk and stress in the event of an actual incident. In a testament to their remarkable efforts, everything went off without a hitch on the big day.
Expected traffic is just that—expected—so in preparing for large traffic events, there is always an element of guesswork. But over time, the Twitter engineering team has become increasingly resilient and adept at weathering complex events, yielding a strong sense of purpose and validation of all the upfront investment in reliability.
This is the first article of a two-part series. Stay tuned for part 2 of the interview with Brian, JP, and Zac to learn about how Twitter is driving adoption of SLOs.
Get similar stories in your inbox weekly, for free
Share this story:
In this blog post, we’ll help you ensure that your backup systems will perform as …