What is MTTR (Mean Time to Recovery), and How Do Elite Engineering Teams at Uber, DoorDash, Airbnb, & Slack Master It?
If you are here to learn the definition of Mean Time to Recovery or Restore, we have mentioned everything beyond the definition (such as formula to measure it, MTTR benchmark, and importance of real-time data in MTTR) in the FAQ section at the end of this blog.
But for some serious real-life knowledge, you should first read this - how elite engineering teams at Uber, DoorDash, Airbnb, and Slack master MTTR.
Let’s start with Uber.
(Please note the following details have been hand-picked from various trusted online sources, including Engineering Blogs of respective companies. However, the real stats, practices, achievements, and tools may differ from the original ones they currently have in place. Also, this is for high-level understanding. The actual incident recovery practice at these giants surely has more depth and width.)
1) MTTR at Uber
Objective:
Achieve high system reliability and rapid incident recovery to realize Uber’s mission of offering seamless transportation services.
Key Initiatives:
• Formation of SRE team: Uber formed a dedicated SRE team in 2014 with two different structural approaches. The Embedded SRE teams work alongside specific engineering teams (data, engineering, front/backend) to ensure high system reliability. Whereas Infrastructure-Focused SRE teams manage infrastructure components, such as observability platforms. With this holistic approach, Uber is able to give specialized attention to individual services and infrastructure reliability.
• Ring0 Initiative: Uber refers to Ring0 as a group of engineers with elevated privilege to make critical decisions during incidents, including degrading or shutting down features or redistributing load to other data centers. With this, Uber eliminates the scope of wasted time due to the gated approach in decision-making, which eventually delays response coupled with delayed recovery.
Incident Response Process:
• Assess: Confirm the nature and scope of the issue.
• Mitigate: Implement immediate measures to minimize impact.
• Delegate: Assign specific tasks to relevant teams or individuals.
• Communicate: Maintain clear communication throughout the incident lifecycle.
Observability and Monitoring Tools:
• Jaeger: Developed by Uber, Jaeger is an open-source, end-to-end distributed tracing system to monitor and troubleshoot complex microservices environments. (Read More)
• M3: Another in-house tool, built by Uber, M3 is a large-scale metrics platform designed to provide real-time insights into system performance. (Read More)
Service Level Agreements (SLAs):
• Uber teams are required to maintain a 99.9% server-side API call success rate. That is equivalent to an allowable downtime of approximately 1.44 minutes per day.
• The team regularly assesses out-of-SLA endpoints to evaluate user impact.
Proactive Failure Simulation:
Uber runs cybersecurity incident simulators to prepare teams for real-world incidents.
Its three-pronged simulation approach includes Tabletop exercises (focus on decision-making, leadership roles, and cross-team collaboration), Red Team Operations (simulations mirroring real-world adversaries to test detection and response), and Atomic Simulators (small, repeatable tests for specific detections, SOPs, and threat intel scenarios).
TIP: Not every company has the resources to build an in-house system like Uber’s M3. That’s where software engineering intelligence platforms like Hivel step in. It helps you bring down MTTR without the heavy engineering lift. Here is the detailed overview.
• By putting AI to real use and analyzing data from development tools like Git, Jira, and CI/CD platforms, Hivel pinpoints context-rich root causes.
• Hivel monitors hotfix cycle times and tags slowdowns. This helps teams to identify blockers and streamline the resolution process for faster MTTR.
• Along with real-time data and trends, Hivel also analyzes historical patterns to unearth long-term trends. With this data, project managers and leadership team can make strategic decisions with an aim to bring down overall MTTR score.
You must read: Strategies to Reduce Mean Time to Restore (MTTR)
2) MTTR at DoorDash
Objective:
To strengthen system reliability and minimize system downtime with an aim to offer seamless services to customers, merchants, and delivery personnel.
Key Initiatives:
• Microservice Architecture Adoption: To improve scalability and developer productivity, DoorDash made a transition to microservice from monolithic. This allowed their developers to develop, deploy, and scale services autonomously, resulting in higher system resilience.
• Fault Injection Testing: By leveraging tools like Filibuster to automatically discover microservice dependencies and inject faults, the team facilitates early detection and mitigation of issues.
• Metric-Aware Rollouts: The Engineering team at DoorDash introduced automated checks on standardized application quality metrics during rollouts. With it, they’ve achieved a certain level of automation in which, upon detecting degradation, deployments get paused automatically, and teams get alerted for quick action.
• Enhanced Observability with Cloud-Native Monitoring: They have adopted a cloud-native monitoring solution to handle the scale and complexity of their infrastructure, which needs to be monitored for faster incident acknowledgment and recovery.
Notable Incident and Root Causes at DoorDash
On May 13, 2022, DoorDash faced a system-wide outage for almost three and a half hours. It was due to a planned database maintenance, which led to increased query latency.
The major root causes identified were underestimation of latency impact from database maintenance, increased service latency resulting in timeout, and exceeded connection limits in Envoy infrastructure. (Read Detailed Report)
3) MTTR at Airbnb
Objective:
Ensure high system availability and swift incident resolution to maintain seamless user experience for guests and hosts.
Key Initiatives:
• Automated Incident Management via Slack: They have developed an incident management bot integrated with Slack, which centralizes and streamlines the incident detection, allocation, and resolution process.
• Transition to Service-Oriented Architecture (SOA): To improve scalability and fault isolation, they have migrated from monolithic architecture to SOA. This transition allowed them to treat the development and deployment of services independently, which facilitates quick identification and resolution of issues.
• Enhanced Monitoring and Alerting: They have implemented Datadog and PagerDuty to standardize monitoring across the infrastructure. These tools alert and offer insights into system performance, helping teams in rapid detection and resolution.
• Blameless Postmortems: They have built a culture of blameless postmortems, which include the identification of root causes and the implementation of preventive measures. This promotes a company-wide culture of continuous learning and improvement.
Most Recent Incident (Feb 2025):
The company reported an increased number of 500 errors across multiple endpoints. It mainly affected pricing settings. The issue was solved in a day. (Source)
4) MTTR at Slack
Objective:
Ensure rapid detection, coordinated response, and efficient recovery from service disruptions.
Key Initiatives:
• Decentralized On-Call Rotations: Initially, at Slack, a centralized on-call team, known as AppOps, used to handle incidents. But as Slack scaled, they introduced a decentralized approach under which individual development teams are assigned responsibility for their respective services. (Read More)
• Adoption of FEMA's Incident Response Guidelines: In 2018, Slack's Reliability Engineering team revisited and revamped their incident response process modeled around FEMA’s Incident Response Guidelines. This resulted in the establishment of Incident Command, run by trained engineers who own the responsibility of incident resolution.
• Automated Security Bots: Slack deploys automated bots that scan for suspicious activities, and if anomalies are detected, the bot army escalates it to the security team. (Know more)
Proactive Testing: Teams at Slack proactively engage in controlled failure exercises to test system resilience. By willingly causing failures to systems, teams can identify weaknesses & potential damage, and based on it, they improve recovery strategies.
Notable Incident and Root Causes at Slack
On January 4, 2021, after the holiday season, a sudden surge of users led to network saturation issues as AWS Transit Gateways (TGWs) did not scale rapidly enough to handle the increased load.
To fix the issue, AWS engineers manually increased the TGWs capacity across all Availability Zones. To eliminate the chances of this happening again, Slack implemented measures to enhance autoscaling.
5 Top MTTR Lessons from Uber, DoorDash, Airbnb, and Slack + 1 Bonus From Hivel
MTTR is not only about fixing the issue. It is also about building a culture of ownership, fostering a recovery-first mindset, designing systems for graceful degradation, and enabling teams with the right tools.
These are the 6 lessons you can learn from elite teams at Uber, DoorDash, Airbnb, and Slack.
• Have and act on pre-defined recovery playbooks.
• Use automated rollbacks and safe deployment strategies to recover faster.
• Build fallback systems for critical services like pricing and availability.
• Use feature flags and decoupled architecture to recover specific services quickly.
• Conduct regular incident drills to practice and refine recovery actions.
Bonus: Track your team’s MTTR score. Because what you can’t measure, you can’t improve.
Hivel is a full-fledged AI-powered Software Engineering Intelligence platform built to give you real visibility into how your engineering team recovered from incidents.
From tracking MTTR to monitoring, Hivel brings incident, productivity, and well-being metrics together in the most immersive dashboards.
Hivel empowers SREs, project managers, and leaders with…
• Seamless integrations across the SDLC, CI/CD, project management, and communication tools for no data silos
• Correlating root causes faster across distributed systems
• Real-time MTTR insights that help identify delays in recovery workflows
• Data-backed retrospectives for faster learning and improvement
FAQs
- What is MTTR (Mean Time to Recovery)?
Mean Time to Recovery or Mean Time to Restore is the average time it takes to recover from a system or product failure. It measures the effectiveness of the recovery process in place. A lower MTTR indicates quicker recovery times and higher operational efficiency.
- How to measure Mean Time to Recovery?
MTTR = Total Downtime / Number of Incidents
For instance, if your system faces a total of 15 hours of downtime over 5 incidents, the MTTR score would be 15 hours (total downtime) / 5 (number of incidents) = 3 hours.
It means that your team takes an average of 3 hours to restore or recover the service after failure.
- What is a good MTTR score?
As per the DORA report, Elite teams fix issues within one hour. Both high-performance and moderate-performance teams fix the issue within a day. A poor performance team takes time between one week to one month to fix the issue.
- Why does real-time data matter for MTTR?
The fundamental nature of any failing system is that it starts slow and spirals fast. Real-time data in the before and during timeframes of an incident gives engineering teams the superpower to deal with this chaos by turning that “Ops…what’s happening!” into “Well, this is the thing.” Here is how.
- Empower teams with live metrics like memory usage, response time, server load, etc.
- Continuous validation of recovery steps
- Collaborative visibility with real-time dashboard, reducing silos and speeding up cross-team collaboration
- Real-time MTTR tracking reveals trends and gaps, and it helps to craft a strategy to handle the next misshape with more engineering value