Methodology8 min read

Fishbone Diagram Template for Incident Analysis (Ishikawa Method)

Learn how to use Fishbone (Ishikawa) diagrams for software incident analysis. Includes a step-by-step walkthrough, worked example, and adapted categories for engineering teams.

By OutageReview Team|February 16, 2026

When an incident has multiple contributing factors across people, process, and technology, a single-threaded 5 Whys analysis often misses half the picture. The Fishbone diagram (also called an Ishikawa diagram) is designed for exactly this situation: it forces you to explore causes across multiple categories simultaneously, ensuring no dimension of the failure goes unexamined.

This guide covers how to use Fishbone diagrams for software incident analysis, with a step-by-step walkthrough, a worked example, and practical tips for adapting the traditional manufacturing categories to modern engineering teams.

What is a Fishbone Diagram?

A Fishbone diagram is a visual cause-and-effect analysis tool. The "head" of the fish represents the problem (the incident). The "bones" branching off the spine represent categories of potential causes. Within each category, you brainstorm specific factors that may have contributed to the failure.

The technique was developed by Kaoru Ishikawa in the 1960s at the University of Tokyo. Originally used in manufacturing quality control, it was one of the first structured approaches to multi-factor root cause analysis. Ishikawa's insight was that most quality problems don't have a single cause—they emerge from the intersection of multiple factors across different dimensions of the system.

The American Society for Quality (ASQ) considers Fishbone diagrams one of the seven basic quality tools, and the technique has been widely adopted in software engineering for post-incident analysis.

Why "Fishbone"?

The completed diagram resembles a fish skeleton: the problem statement is the head, the main categories are the large bones branching off the spine, and specific causes are smaller bones branching off each category. The visual structure makes it easy to see which categories have the most contributing factors and where investigation should focus.

The 6 Categories for Software Incidents

The traditional manufacturing Fishbone uses the "6M" categories: Man, Machine, Material, Method, Measurement, and Mother Nature. For software incidents, we adapt these to categories that better reflect how modern engineering teams operate:

PPeople

Human factors: knowledge gaps, training, staffing, on-call experience, cross-team coordination, cognitive load during the incident.

PrProcess

Workflow gaps: deployment procedures, change management, review processes, escalation paths, runbook coverage, testing requirements.

TTechnology

System issues: architecture limitations, capacity planning, single points of failure, missing circuit breakers, database performance, dependency management.

ToTooling

Observability and operational tools: monitoring gaps, alerting thresholds, logging coverage, deployment tooling, incident management tools.

EEnvironment

External factors: traffic spikes, third-party service failures, infrastructure provider issues, seasonal patterns, regulatory changes.

CCommunication

Information flow: documentation quality, handoff procedures, cross-team visibility, incident communication channels, knowledge sharing.

Not every category will have contributing factors for every incident. The value of the framework is that it forces you to consider each dimension, preventing tunnel vision on the most obvious technical cause.

Step-by-Step: Running a Fishbone Analysis

Here's how to facilitate a Fishbone diagram session for a software incident. This works well as the analysis portion of a 30-minute post-mortem, after the timeline has been reviewed.

  1. Define the problem statement. Write the incident in one clear sentence at the "head" of the fish. Be specific: "Payment processing was unavailable for 47 minutes during peak traffic" is better than "payments were down."
  2. Draw the skeleton. Create the spine with six bones for each category (People, Process, Technology, Tooling, Environment, Communication). You can use a whiteboard, a shared doc, or a purpose-built tool.
  3. Brainstorm causes per category. Go through each category and ask: "What factors in this category contributed to the incident?" Spend 2-3 minutes per category. Write every suggestion—don't evaluate yet.
  4. Identify the key contributors. Review all the causes. Mark the ones that were significant contributors (not everything on the diagram is equally important). Typically 3-5 factors emerge as the primary drivers.
  5. Drill deeper with 5 Whys. For each key contributor, ask "why?" to dig deeper. The Fishbone tells you what contributed; 5 Whys tells you why those conditions existed.
  6. Define action items. Each key contributor should map to at least one corrective action. Assign owners and deadlines. Track these to completion.

Worked Example: Production Payment Outage

Let's walk through a complete Fishbone analysis for a realistic incident: a 47-minute payment processing outage during Black Friday peak traffic.

Incident: Payment API returned 503 errors for 47 minutes during Black Friday peak traffic, affecting ~12,000 transactions.

Fishbone Diagram: Payment Processing Outage

People Process Technology | | | |-- On-call eng |-- No pre-deploy |-- DB connection | unfamiliar | checklist for | pool too small | with payment | payment changes | for peak load | service | | | |-- No rollback |-- No circuit |-- No cross- | procedure | breaker on | training on | documented | payment→DB | payment svc | | calls | |-- Config change | | | skipped staging | | | | ====|=====================|=======================|=====> OUTAGE | | | 47 min | | | payment |-- Slack channel |-- Monitoring | downtime | for payments | didn't alert on | | was archived | connection pool |-- Black Friday | | exhaustion | traffic spike |-- Escalation | | (3x normal) | path unclear |-- Logs rotated | | for payment | before |-- Payment provider | issues | investigation | latency increase | | | Communication Tooling Environment

Key Contributors Identified

After reviewing all factors, the team identified these as the primary drivers:

  1. Technology: Connection pool too small for peak load. The database connection pool was sized for normal traffic. Black Friday traffic was 3x normal, exhausting the pool within minutes. This was the immediate trigger.
  2. Process: Config change skipped staging. The connection pool setting had been reduced two weeks earlier as a "minor config change" that bypassed the standard deployment pipeline. It was never load-tested.
  3. Tooling: No alerting on connection pool exhaustion. Monitoring existed for CPU and memory but not for connection pool utilization. The team didn't know the pool was exhausted until customers reported failures.
  4. People: No cross-training on payment service. The on-call engineer had never worked on the payment service. Troubleshooting took longer because they had to learn the system during the incident.

Drilling Deeper with 5 Whys

Taking the top contributor (connection pool sizing) and applying 5 Whys:

Why was the connection pool too small?

It was reduced from 100 to 25 connections two weeks ago.

Why was it reduced?

An engineer was debugging a connection leak and reduced the pool to isolate the issue.

Why wasn't it restored afterward?

The debug change was committed directly to the config repo without a ticket or follow-up task.

Why was a direct config change possible without review?

Config changes bypass the normal PR review process; they're treated as "operational" rather than "code."

Root cause: Config changes lack the same review and testing requirements as code changes.

This is the power of combining Fishbone with 5 Whys: the Fishbone identified that the connection pool was a key factor, and the 5 Whys revealed why that condition existed. The resulting action items address both the immediate issue (restore pool size, add alerting) and the systemic issue (treat config changes like code).

When to Use Fishbone vs. 5 Whys vs. Both

Different incidents call for different analysis techniques. Here's how to choose:

ScenarioRecommended Technique
Simple incident, single clear causal chain5 Whys alone
Multiple teams/services involvedFishbone first, then 5 Whys on key factors
Repeat incident (happened before)Fishbone (the previous 5 Whys clearly missed something)
Unclear where the problem originatedFishbone to map the landscape, then 5 Whys to drill down
P1 with significant customer impactBoth: comprehensive analysis warranted

For a comprehensive overview of all RCA techniques and when to apply them, see our ultimate guide to engineering root cause analysis.

Common Fishbone Mistakes

1. Treating It as a Checklist

The Fishbone is a brainstorming tool, not a form to fill out. If you're mechanically writing one cause per category and moving on, you're missing the point. Some categories will have five contributors; others will have zero. That's fine. The goal is exploration, not completion.

2. Stopping at the Fishbone

A Fishbone diagram identifies contributing factors, but it doesn't explain why those factors existed. Always follow up the Fishbone with deeper analysis (5 Whys) on the key contributors. The Fishbone is the map; the 5 Whys is the excavation.

3. One Person Fills It Out Alone

Fishbone diagrams work best as a group exercise. Different team members have visibility into different parts of the system. The on-call engineer sees the response. The service owner sees the architecture. The manager sees the process gaps. You need all perspectives.

4. Ignoring "Soft" Categories

Engineers naturally gravitate toward Technology and Tooling. But the People, Process, and Communication categories often contain the most impactful root causes. A missing runbook (Process) or unclear escalation path (Communication) can turn a 5-minute fix into a 45-minute outage.

Fishbone Diagrams in OutageReview

OutageReview includes a built-in Fishbone diagram builder adapted for software incidents. You can brainstorm causes in each category, mark key contributors, and seamlessly transition to 5 Whys analysis on any bone. The diagram is saved as part of the investigation record, so you can spot patterns across incidents (e.g., "Communication is a contributing category in 60% of our P1s").

Tracking Metrics from Fishbone Analysis

One underused benefit of Fishbone analysis is the structured data it produces. Over time, you can track which categories appear most frequently across incidents:

  • If Process is the top category, invest in deployment procedures and checklists
  • If Tooling dominates, your observability has gaps
  • If People keeps appearing, you have training or staffing issues
  • If Communication is frequent, your handoff and documentation practices need work

This kind of cross-incident analysis turns individual post-mortems into an organizational improvement roadmap. For more on building effective post-mortem processes, see our step-by-step checklist.

Frequently Asked Questions

What is a Fishbone diagram?

A Fishbone diagram (also called an Ishikawa diagram or cause-and-effect diagram) is a visual root cause analysis tool that maps potential causes of a problem across multiple categories. The diagram resembles a fish skeleton, with the problem statement at the head and causes branching off the spine under categories like People, Process, Technology, Tooling, Environment, and Communication. It was developed by Kaoru Ishikawa in the 1960s and is widely used in software incident analysis.

When should I use a Fishbone diagram instead of 5 Whys?

Use a Fishbone diagram when the incident involves multiple contributing factors across different parts of the system, when multiple teams were involved, or when a previous 5 Whys analysis didn't prevent recurrence. Fishbone diagrams excel at ensuring you consider all dimensions of a failure, not just the most obvious technical cause. For simple incidents with a clear linear causal chain, 5 Whys alone is usually sufficient.

What are the 6M categories in a Fishbone diagram?

The traditional manufacturing 6M categories are Man, Machine, Material, Method, Measurement, and Mother Nature (Environment). For software incidents, these are typically adapted to: People (human factors and knowledge), Process (workflows and procedures), Technology (architecture and systems), Tooling (monitoring and operational tools), Environment (external factors and infrastructure), and Communication (information flow and documentation). The categories can be customized to fit your organization.

How long does a Fishbone analysis take?

A Fishbone brainstorming session typically takes 15-20 minutes when facilitated well. Spend 2-3 minutes per category brainstorming potential causes, then 5 minutes identifying the key contributors. The subsequent 5 Whys drill-down on each key contributor adds another 10-15 minutes. In total, a Fishbone-plus-5-Whys analysis fits comfortably within a 30-minute post-mortem format.

Key Takeaways

  • Fishbone diagrams force you to explore causes across People, Process, Technology, Tooling, Environment, and Communication
  • Use Fishbone for multi-factor incidents, then drill deeper with 5 Whys on key contributors
  • Don't skip the "soft" categories—People, Process, and Communication often contain the most impactful root causes
  • Track which categories appear most across incidents to identify organizational improvement priorities

Ready to improve your incident investigations?

OutageReview gives you 5 Whys, Fishbone diagrams, timeline tracking, and rigor scoring in one platform.

Start Your 14-Day Free Trial

No credit card required