Incident vs Bug: How to Decide in Real Production Systems
Not every bug is an incident. Not every incident starts with a bug. Here's the decision framework I use in real production systems — two questions that cut through the noise every time.
On this page
- The Two Questions
- Question 1: Is the user blocked?
- Question 2: Is business impacted?
- The Decision Tree
- Why This Distinction Matters
- Incidents require a different response mode
- Bugs require a different prioritization mode
- Real Examples
- This was an incident
- This was a bug, not an incident
- This looked like a bug but was an incident
- This felt like an incident but wasn't
- The Severity Tiers I Use
- Setting Up the System
- The Rule in One Sentence
It's 11pm. A Slack message comes in: "something is wrong with the app."
Is this an incident? A bug? A user error? Do you wake someone up? Do you roll back? Do you wait until morning?
The wrong call in either direction is expensive. Declare everything an incident and your team burns out on false alarms. Dismiss real incidents as bugs and customers churn while you're asleep.
Here's the two-question framework I've used across 13+ years in QA to make this call fast and consistently.
The Two Questions
When something goes wrong in production, ask these in order:
Question 1: Is the user blocked?
Not annoyed. Not confused. Blocked.
Can they complete the core action they came to do? If yes — it's not an incident yet. If no — keep going.
Examples of BLOCKED:
- Cannot log in
- Cannot complete checkout
- App crashes on launch
- Core feature returns error 500
- Data is corrupted or missing
Examples of NOT BLOCKED (just degraded):
- Slow load time (but loads)
- UI element misaligned
- Non-critical feature broken
- Error message is unhelpful but recovery is possibleQuestion 2: Is business impacted?
Even if users aren't blocked, some failures have direct business consequences.
Business impact indicators:
- Revenue processing is failing
- Data is being lost (not just unavailable — gone)
- SLA breach is imminent or occurring
- Regulatory / compliance violation
- Reputational damage happening in real-time (viral bad reviews)
- Partner / enterprise customer is affected[!IMPORTANT] If either answer is YES → it's an incident. Treat it as one immediately. Don't wait for more data. Don't try to diagnose first. Declare the incident, notify the right people, then investigate.
The Decision Tree
Something is wrong in production
│
▼
Is the user BLOCKED from core functionality?
│
YES ─┼─────────────────────────→ INCIDENT
│
NO
│
▼
Is there direct business impact?
(revenue, data loss, SLA, compliance)
│
YES ─┼─────────────────────────→ INCIDENT
│
NO
│
▼
BUG
(log it, prioritize, fix in normal cycle)This isn't about severity labels. It's about response speed. Incidents get immediate human attention. Bugs get queued.
Why This Distinction Matters
Incidents require a different response mode
When you declare an incident:
- A dedicated person owns communication (not the engineer debugging)
- Status updates go out on a fixed cadence (every 30 minutes minimum)
- The goal is restore service first, understand why later
- Post-mortem is scheduled automatically
When you treat an incident like a bug:
- Engineers disappear into debugging while customers are blocked
- No communication goes out — customers assume you don't know
- The fix takes longer because the pressure and focus aren't there
- You lose trust, even if you fix it quickly
Bugs require a different prioritization mode
Not every bug is urgent. A bug that affects 0.1% of users on a non-critical screen is not the same as a bug that affects every user who tries to reset their password.
Treating everything as an incident creates alarm fatigue. Your team stops responding with urgency because urgency has been diluted.
[!TIP] Keep a P1/P2/P3 bug severity system separate from your incident process. Incidents are time-sensitive production issues. P1 bugs are severe but can be fixed in the normal release cycle.
Real Examples
This was an incident
"Users are reporting they can't log in. Support tickets spiking. Tried to reproduce — confirmed. Login returns 401 for valid credentials."
- Users blocked? ✅ Yes — completely blocked
- Business impact? ✅ Yes — no one can access the product
Decision: Incident. Wake up the on-call. Post a status update. Roll back the auth deploy from 2 hours ago.
This was a bug, not an incident
"Avatar images aren't loading on the profile page. Shows broken image icon."
- Users blocked? ❌ No — they can still use the app, just no avatar
- Business impact? ❌ No — no revenue or data impact
Decision: Bug. Log it. P2 severity. Fix in the next sprint. No one gets woken up.
This looked like a bug but was an incident
"Payment confirmation emails are going to spam for some users."
At first glance: not blocking users (they can still pay), looks like an email deliverability issue.
Dig deeper:
- "Some users" = 40% of users
- Users not seeing confirmation emails are calling support to verify if payment went through
- Support volume is 5x normal
- Some users are placing duplicate orders out of uncertainty
Revised decision: Incident. Business impact is real. Revenue is at risk from chargebacks. Support costs are spiking.
[!WARNING] The most dangerous incidents are the ones that don't look like incidents at first. Always check the scope before dismissing something as a bug.
This felt like an incident but wasn't
"App is slow. Everything is taking 3x longer than normal."
Emotional response: everything is broken, wake everyone up.
Apply the framework:
- Users blocked? ❌ No — slow but functional
- Business impact? Check the metrics: conversion rate normal, no error spikes, just latency up
Decision: Bug. P1 bug — fix today, not next sprint — but not an incident. No all-hands. No status page update. Investigate during business hours.
Declaring this an incident would have woken up 5 people for something that resolved itself when a CDN cache expired.
The Severity Tiers I Use
Once you've decided it's a bug (not an incident), you still need to prioritize it:
P1 — Fix today, before next release
├── Data loss risk
├── Security vulnerability
├── Affects >20% of users
└── Core flow broken for a segment
P2 — Fix in current sprint
├── Affects <20% of users
├── Workaround exists
└── Non-core feature broken
P3 — Backlog
├── Visual/cosmetic issue
├── Affects <1% of users
├── Edge case with easy workaround
└── Enhancement misclassified as bugSetting Up the System
You can't make good incident vs. bug decisions in the moment if you haven't defined them in advance. When everything is on fire is the worst time to debate definitions.
Document these before you need them:
1. What counts as "user blocked" for your product?
Define your core user flows. For a food delivery app: search → select → add to cart → checkout → payment. Any of these failing = incident. Everything else = bug.
2. What counts as business impact?
Agree on thresholds. Revenue drop >X%? SLA breach imminent? Data loss of any kind? Write it down.
3. Who declares an incident?
Any engineer on call should have the authority. Don't require manager approval — that delay costs you during a real incident. Trust your team to use the framework.
4. What's the first action when an incident is declared?
Have a runbook. Ours is:
1. Post in #incidents: "Incident declared: [description] | Owner: @name"
2. Check status page — update if user-facing
3. Start a video call if more than one person is needed
4. Owner sends updates every 30 minutes until resolved
5. After resolution: schedule post-mortem within 48 hours[!NOTE] The post-mortem is not optional. That's where you find the systemic issue behind the incident. Without it, you'll have the same incident again in 3 months.
The Rule in One Sentence
If users are blocked or business is impacted, it's an incident. Everything else is a bug.
Apply it fast, apply it consistently, and trust it. The framework exists so you don't have to think clearly at 11pm when you're half asleep and something is on fire.
Speed of correct classification is a skill. It comes from using the same two questions every time until they're automatic.
Sudarshan Chaudhari
AI Systems Builder / Product Engineer
Bangkok, Thailand
Solo Android developer with 13+ years in QA, building Android apps, AI automation systems, and developer tools at SudarshanTechLabs.
Related Posts
Building something? Available for Android dev and QA consulting.
Work with meComments — powered by Giscus
