April 30, 20267 min read

Incident vs Bug: How to Decide in Real Production Systems

Not every bug is an incident. Not every incident starts with a bug. Here's the decision framework I use in real production systems — two questions that cut through the noise every time.

TestingEngineeringProduction

On this page

The Two Questions
Question 1: Is the user blocked?
Question 2: Is business impacted?
The Decision Tree
Why This Distinction Matters
Incidents require a different response mode
Bugs require a different prioritization mode
Real Examples
This was an incident
This was a bug, not an incident
This looked like a bug but was an incident
This felt like an incident but wasn't
The Severity Tiers I Use
Setting Up the System
The Rule in One Sentence

It's 11pm. A Slack message comes in: "something is wrong with the app."

Is this an incident? A bug? A user error? Do you wake someone up? Do you roll back? Do you wait until morning?

The wrong call in either direction is expensive. Declare everything an incident and your team burns out on false alarms. Dismiss real incidents as bugs and customers churn while you're asleep.

Here's the two-question framework I've used across 13+ years in QA to make this call fast and consistently.

The Two Questions

When something goes wrong in production, ask these in order:

Question 1: Is the user blocked?

Not annoyed. Not confused. Blocked.

Can they complete the core action they came to do? If yes — it's not an incident yet. If no — keep going.

code

Examples of BLOCKED:
- Cannot log in
- Cannot complete checkout
- App crashes on launch
- Core feature returns error 500
- Data is corrupted or missing

Examples of NOT BLOCKED (just degraded):
- Slow load time (but loads)
- UI element misaligned
- Non-critical feature broken
- Error message is unhelpful but recovery is possible

Question 2: Is business impacted?

Even if users aren't blocked, some failures have direct business consequences.

code

Business impact indicators:
- Revenue processing is failing
- Data is being lost (not just unavailable — gone)
- SLA breach is imminent or occurring
- Regulatory / compliance violation
- Reputational damage happening in real-time (viral bad reviews)
- Partner / enterprise customer is affected

[!IMPORTANT] If either answer is YES → it's an incident. Treat it as one immediately. Don't wait for more data. Don't try to diagnose first. Declare the incident, notify the right people, then investigate.

The Decision Tree

code

Something is wrong in production
         │
         ▼
Is the user BLOCKED from core functionality?
         │
    YES ─┼─────────────────────────→ INCIDENT
         │
        NO
         │
         ▼
Is there direct business impact?
(revenue, data loss, SLA, compliance)
         │
    YES ─┼─────────────────────────→ INCIDENT
         │
        NO
         │
         ▼
        BUG
(log it, prioritize, fix in normal cycle)

This isn't about severity labels. It's about response speed. Incidents get immediate human attention. Bugs get queued.

Why This Distinction Matters

Incidents require a different response mode

When you declare an incident:

A dedicated person owns communication (not the engineer debugging)
Status updates go out on a fixed cadence (every 30 minutes minimum)
The goal is restore service first, understand why later
Post-mortem is scheduled automatically

When you treat an incident like a bug:

Engineers disappear into debugging while customers are blocked
No communication goes out — customers assume you don't know
The fix takes longer because the pressure and focus aren't there
You lose trust, even if you fix it quickly

Bugs require a different prioritization mode

Not every bug is urgent. A bug that affects 0.1% of users on a non-critical screen is not the same as a bug that affects every user who tries to reset their password.

Treating everything as an incident creates alarm fatigue. Your team stops responding with urgency because urgency has been diluted.

[!TIP] Keep a P1/P2/P3 bug severity system separate from your incident process. Incidents are time-sensitive production issues. P1 bugs are severe but can be fixed in the normal release cycle.

Real Examples

This was an incident

"Users are reporting they can't log in. Support tickets spiking. Tried to reproduce — confirmed. Login returns 401 for valid credentials."

Users blocked? ✅ Yes — completely blocked
Business impact? ✅ Yes — no one can access the product

Decision: Incident. Wake up the on-call. Post a status update. Roll back the auth deploy from 2 hours ago.

This was a bug, not an incident

"Avatar images aren't loading on the profile page. Shows broken image icon."

Users blocked? ❌ No — they can still use the app, just no avatar
Business impact? ❌ No — no revenue or data impact

Decision: Bug. Log it. P2 severity. Fix in the next sprint. No one gets woken up.

This looked like a bug but was an incident

"Payment confirmation emails are going to spam for some users."

At first glance: not blocking users (they can still pay), looks like an email deliverability issue.

Dig deeper:

"Some users" = 40% of users
Users not seeing confirmation emails are calling support to verify if payment went through
Support volume is 5x normal
Some users are placing duplicate orders out of uncertainty

Revised decision: Incident. Business impact is real. Revenue is at risk from chargebacks. Support costs are spiking.

[!WARNING] The most dangerous incidents are the ones that don't look like incidents at first. Always check the scope before dismissing something as a bug.

This felt like an incident but wasn't

"App is slow. Everything is taking 3x longer than normal."

Emotional response: everything is broken, wake everyone up.

Apply the framework:

Users blocked? ❌ No — slow but functional
Business impact? Check the metrics: conversion rate normal, no error spikes, just latency up

Decision: Bug. P1 bug — fix today, not next sprint — but not an incident. No all-hands. No status page update. Investigate during business hours.

Declaring this an incident would have woken up 5 people for something that resolved itself when a CDN cache expired.

The Severity Tiers I Use

Once you've decided it's a bug (not an incident), you still need to prioritize it:

code

P1 — Fix today, before next release
├── Data loss risk
├── Security vulnerability
├── Affects >20% of users
└── Core flow broken for a segment

P2 — Fix in current sprint
├── Affects <20% of users
├── Workaround exists
└── Non-core feature broken

P3 — Backlog
├── Visual/cosmetic issue
├── Affects <1% of users
├── Edge case with easy workaround
└── Enhancement misclassified as bug

Setting Up the System

You can't make good incident vs. bug decisions in the moment if you haven't defined them in advance. When everything is on fire is the worst time to debate definitions.

Document these before you need them:

1. What counts as "user blocked" for your product?

Define your core user flows. For a food delivery app: search → select → add to cart → checkout → payment. Any of these failing = incident. Everything else = bug.

2. What counts as business impact?

Agree on thresholds. Revenue drop >X%? SLA breach imminent? Data loss of any kind? Write it down.

3. Who declares an incident?

Any engineer on call should have the authority. Don't require manager approval — that delay costs you during a real incident. Trust your team to use the framework.

4. What's the first action when an incident is declared?

Have a runbook. Ours is:

code

1. Post in #incidents: "Incident declared: [description] | Owner: @name"
2. Check status page — update if user-facing
3. Start a video call if more than one person is needed
4. Owner sends updates every 30 minutes until resolved
5. After resolution: schedule post-mortem within 48 hours

[!NOTE] The post-mortem is not optional. That's where you find the systemic issue behind the incident. Without it, you'll have the same incident again in 3 months.

The Rule in One Sentence

If users are blocked or business is impacted, it's an incident. Everything else is a bug.

Apply it fast, apply it consistently, and trust it. The framework exists so you don't have to think clearly at 11pm when you're half asleep and something is on fire.

Speed of correct classification is a skill. It comes from using the same two questions every time until they're automatic.

Sudarshan Chaudhari

AI Systems Builder / Product Engineer

Bangkok, Thailand

Solo Android developer with 13+ years in QA, building Android apps, AI automation systems, and developer tools at SudarshanTechLabs.

GitHub Play Store

Stay updated

Get new posts on Android, Kotlin, and solo dev straight to your inbox.

RSS Feed Telegram