The 5 Whys technique is one of the most powerful tools in root cause analysis, and one of the most commonly misused. Developed at Toyota by Taiichi Ohno as part of the Toyota Production System, it's deceptively simple: keep asking "why" until you reach the underlying cause. But simplicity doesn't mean easy. Here's how to use 5 Whys effectively in DevOps contexts.
What is the 5 Whys Technique?
The 5 Whys is an iterative interrogation technique used in root cause analysis (RCA). You start with a problem statement and ask "why did this happen?" For each answer, you ask "why?" again until you reach a root cause that you can actually address.
The number five is a guideline, not a rule. Some problems are solved in three whys. Others need seven or eight. The goal is to go deep enough to find causes you can fix, not to hit an arbitrary count.
Key Principle
A good 5 Whys analysis should end at a cause you can control. If your analysis ends at "because users are stupid" or "because the vendor made a mistake," you haven't reached an actionable root cause. Keep asking: why did users make that choice? Why didn't we protect against the vendor issue?
History and Origins of the 5 Whys
The 5 Whys technique was developed by Taiichi Ohno, the industrial engineer widely regarded as the architect of the Toyota Production System (TPS). Ohno began using the technique in the 1930s as a practical way to solve production problems on Toyota's factory floor. Rather than treating symptoms, Ohno insisted on drilling past the obvious to find the true source of manufacturing defects and inefficiencies.
Ohno formalized and popularized the method in his influential 1988 book, "Toyota Production System: Beyond Large-Scale Production," where he described how repeatedly asking "why" could peel back the layers of a problem to reveal its root cause. He famously demonstrated this with a simple example: a machine stopped working, and by asking "why" five times, the team traced the failure from a blown fuse all the way back to an inadequate preventive maintenance schedule.
For decades, the 5 Whys remained primarily a manufacturing discipline. That changed when the Lean and Agile movements adopted Toyota's principles for software development. Eric Ries further popularized the technique in the startup world through his book "The Lean Startup," where he advocated using 5 Whys for diagnosing technical and organizational problems. From there, it spread rapidly into DevOps and Site Reliability Engineering (SRE) practices, becoming one of the most widely adopted RCA techniques in the industry.
Today, companies like Google, Amazon, and Etsy use 5 Whys as a standard part of their incident review processes. Its simplicity is both its greatest strength and its most dangerous weakness: anyone can facilitate a 5 Whys session with no special training, but that same accessibility means it's easy to do poorly. Without discipline and evidence-based answers, a 5 Whys analysis can devolve into speculation and finger-pointing rather than genuine root cause identification.
5 Whys Template for DevOps Teams
Use this template to structure your 5 Whys analysis. Fill in each level, and don't move to the next "why" until you've verified the current answer is accurate.
5 Whys Analysis Template
Incident
[Describe the incident in one sentence]
Impact
[Duration, affected users/services, business impact]
Why #1
[What was the immediate technical cause?]
Why #2
[Why did that happen?]
Why #3
[Why did that happen?]
Why #4
[Why did that happen?]
Why #5
[Why did that happen?]
Root Cause
[Summarize the root cause(s) identified]
Corrective Actions
[List specific actions with owners and due dates]
Real-World DevOps Examples
Example 1: Production Database Outage
Incident: Production database became unresponsive for 45 minutes
Why #1: Why was the database unresponsive?
The database ran out of available connections.
Why #2: Why did it run out of connections?
A new API endpoint was creating connections but not releasing them.
Why #3: Why wasn't it releasing connections?
The developer used a manual connection instead of the connection pool, and didn't close it in the error path.
Why #4: Why wasn't this caught in code review?
The reviewer wasn't familiar with our database patterns, and we don't have automated checks for connection handling.
Why #5: Why don't we have automated checks?
We haven't established standard database patterns or linting rules for connection management.
Root Causes:
- No enforced standards for database connection handling
- No automated detection of connection leaks
- Knowledge gap in code review process
Actions:
- Create and document database connection standards
- Add lint rule to detect manual connection usage
- Add connection pool monitoring and alerting
- Include database patterns in onboarding
Example 2: Deployment Caused Service Degradation
Incident: API latency increased 10x after deployment
Why #1: Why did latency increase after deployment?
The new version was making N+1 database queries.
Why #2: Why was it making N+1 queries?
A refactored function moved a batch query into a loop.
Why #3: Why wasn't this caught before production?
Our staging environment has much less data than production, so the query count wasn't noticeable.
Why #4: Why is staging data so different from production?
We don't have a process for keeping staging data realistic.
Why #5: Why don't we have canary deployments to catch this?
Our deployment pipeline doesn't support gradual rollouts.
Root Causes:
- Staging environment not representative of production scale
- No gradual rollout capability to catch performance regressions
- No query count monitoring in deployment validation
Example 3: Alert Fatigue Led to Missed Incident
Incident: Customers reported outage 30 minutes before team noticed
Why #1: Why didn't the team notice for 30 minutes?
The critical alert was buried in a flood of other alerts.
Why #2: Why were there so many other alerts?
We have 200+ alerts, and many fire frequently without requiring action.
Why #3: Why do alerts fire without requiring action?
Alert thresholds were set too aggressively, and nobody has tuned them.
Why #4: Why hasn't anyone tuned the alerts?
There's no owner for alert hygiene, and no process for reviewing alert effectiveness.
Root Causes:
- No alert ownership or hygiene process
- Alerts not prioritized by severity
- No regular review of alert signal-to-noise ratio
Common 5 Whys Mistakes
1. Stopping Too Early
The most common mistake. If your 5 Whys ends at "the config was wrong," you haven't found the root cause. Why was it wrong? Why wasn't it caught? What systemic issue allowed this?
2. Following Only One Path
Most incidents have multiple contributing factors. If you only follow one "why" chain, you miss the others. Consider branching your analysis when there are multiple valid answers to "why."
Tip: Multi-Path Analysis
When a single "why" has multiple valid answers, explore each path. This often reveals that what seemed like a single incident was actually the intersection of multiple systemic issues. OutageReview supports branching cause chains natively, making it easy to explore multiple paths and document them in one investigation.
3. Blaming People
"Because the developer made a mistake" is not a root cause. People always make mistakes. The question is: what allowed the mistake to reach production? What guardrails were missing?
4. Accepting Unverified Answers
Each "why" answer should be verified with evidence. Logs, metrics, code commits, chat transcripts. If you're guessing at answers, your root cause analysis is built on assumptions.
5. Ending at External Factors
"The third-party service went down" is not something you can fix. But you can fix: why didn't we have redundancy? Why wasn't there a fallback? Why didn't we detect it faster?
When NOT to Use 5 Whys
The 5 Whys is a powerful tool, but it's not the right tool for every situation. Knowing when to reach for a different technique is just as important as knowing how to use 5 Whys well. Here are the scenarios where you should consider alternatives:
- Complex, multi-factor incidents spanning multiple systems. When an outage involves cascading failures across databases, networking, application logic, and third-party services, the 5 Whys tends to oversimplify by forcing you down a single causal path. In these cases, a fishbone diagram template is more appropriate because it explores causes across multiple categories simultaneously.
- When there's no clear starting point or problem statement. 5 Whys requires a well-defined problem to anchor the first "why." If the team can't agree on what the actual problem was, you need to do timeline reconstruction and impact assessment first before starting any causal analysis.
- When the team is emotionally charged or blame dynamics are present. The 5 Whys demands honest, evidence-based answers. If people fear that their answers will be used against them, they'll give defensive or misleading responses. Cool down first. Establish psychological safety. Then run the analysis.
- When the incident involves political or organizational factors that can't be discussed openly. Sometimes the root cause is a leadership decision, a budget cut, or a staffing problem that people are unwilling to name in a group setting. The 5 Whys will stall at a surface-level answer because the real answer is socially unacceptable to state aloud.
- Large-scale outages affecting many services simultaneously. When dozens of services go down at once, there are usually multiple independent root causes interacting. A single 5 Whys chain cannot capture this complexity. Use a fishbone diagram or fault tree analysis to map the full picture, then use 5 Whys to drill into specific branches.
Rule of Thumb
5 Whys is best for incidents with a relatively linear causal chain, where one thing led to another in a clear sequence. If you find yourself wanting to give multiple answers at the same "why" level, or if the causes span entirely different domains, switch to a fishbone diagram or combine techniques.
Tips for Better 5 Whys
- Build the timeline first. You can't analyze causes accurately without understanding what happened and when. Start with a detailed chronology.
- Involve the right people. Include those who were on-call, those who wrote the code, and those who understand the architecture. Different perspectives reveal different causes.
- Write it down as you go. Don't rely on memory. Document each "why" and its answer during the discussion, not after.
- Look for systemic issues. Individual mistakes are symptoms. Look for the process, tooling, or cultural factors that allowed them.
- End with actions. Every 5 Whys should produce specific, assigned corrective actions. Otherwise, you've understood the problem but done nothing to fix it.
Combining 5 Whys with Other Techniques
5 Whys works well alone, but it's even more powerful combined with other RCA methods:
- Timeline analysis helps ensure you're asking "why" about the right things. A detailed chronology reveals which events actually mattered and in what order, which directly improves the quality of your "why" questions. Without a timeline, you're guessing at causes based on incomplete information.
- Fishbone diagrams ensure you're considering causes across multiple categories. One particularly effective approach is to use 5 Whys as a drill-down tool within a fishbone diagram: first map out all the potential cause categories (people, process, technology, environment), then pick one "bone" and apply the 5 Whys to go deeper on that specific branch. This gives you both breadth and depth in your analysis.
- Post-mortem checklists ensure you don't skip critical steps in the investigation
The best RCA processes use multiple tools depending on the incident complexity. Simple incidents might need only 5 Whys. Complex incidents benefit from a full toolkit. Tools like OutageReview let you switch between techniques within a single investigation, so you can start with a fishbone diagram to map the landscape and then drill into specific branches with 5 Whys, all without losing context or duplicating work. If your post-mortems tend to run long or lose focus, see our guide to running a focused 30-minute incident review.
Frequently Asked Questions
What is the 5 Whys technique?
The 5 Whys is a root cause analysis technique where you repeatedly ask "why" in response to a problem statement, typically five times, to drill past surface-level symptoms and reach the underlying cause. Each answer becomes the basis for the next question, creating a chain of causation from the observed problem to its root cause. The method was developed at Toyota and is now widely used in DevOps, SRE, and incident management.
How many whys should you ask?
Five is a guideline, not a rule. Some problems reach their root cause in three whys, while complex issues may require seven or eight. The right number is however many it takes to reach a cause that is actionable—something your team can actually fix to prevent recurrence. Stop when you've found a systemic or process issue that you can address, not when you've hit an arbitrary count.
When should you NOT use 5 Whys?
5 Whys works best for incidents with a relatively linear causal chain. Avoid using it as the sole technique for complex, multi-system failures where causes span multiple teams and services. In those cases, a Fishbone (Ishikawa) diagram is more appropriate because it explores causes across multiple categories simultaneously. Also avoid 5 Whys when blame dynamics are present—the technique requires honest answers, which people won't give if they fear punishment.
Who invented the 5 Whys method?
The 5 Whys method was developed by Taiichi Ohno, the architect of the Toyota Production System (TPS), in the 1930s. Ohno used it as a practical problem-solving tool on Toyota's factory floor to identify the root causes of manufacturing defects. He described the technique in his 1988 book "Toyota Production System: Beyond Large-Scale Production." The method later spread to software engineering through the Lean and Agile movements.
Key Takeaways
- Keep asking "why" until you reach causes you can control and fix
- Verify each answer with evidence before moving to the next "why"
- Focus on systems and processes, not individual blame
- Explore multiple paths when incidents have multiple contributing factors