Skip to content
All posts
February 13, 20264 min read

Intermittent Bugs: How to Handle the Hardest Bugs in QA

Intermittent bugs appear inconsistently, defy reproduction, and are the hardest class of issues to resolve. Here's a systematic approach to investigating, documenting, and ultimately resolving them.

TestingDebuggingBug ReportsBest Practices
Share:

Intermittent bugs are the most frustrating category in QA. They're real — you've seen them. But they don't happen consistently enough to pin down. Developers can't reproduce them. "Cannot reproduce" tickets pile up. And then the same bug surfaces in production.

Here's how to handle them systematically.


Why Intermittent Bugs Are Hard

Non-deterministic reproduction. By definition, you can't make them happen on demand. Traditional bug investigation assumes you can reliably trigger the failure. Intermittent bugs break this assumption.

Time-sensitive state. Many intermittent bugs depend on state that changes — session tokens that expire, network requests that race, background jobs that run at specific intervals. By the time you investigate, the state has changed.

Multiple contributing factors. A bug that only appears when: the user has been logged in for 4+ hours AND is on a cellular connection AND has low storage — is not going to be found by testing any one of those factors individually.


Classification: Is It Actually Intermittent?

First, verify you're dealing with a genuinely intermittent bug. Some "intermittent" bugs are:

  • Consistently reproducible with the right steps (you're missing a precondition)
  • Device-specific (consistent on that device, not on yours)
  • Account-state-specific (consistent for users with that data state, not your test account)
  • Timing-related (consistent if you reproduce the exact timing)

Test these alternatives before accepting "intermittent" as the classification. A bug that's 1-in-10 is intermittent. A bug that reproduces 10/10 on a Samsung with MIUI but 0/10 on your Pixel is device-specific, which is actually more tractable.


Documentation Strategy for Intermittent Bugs

Document every occurrence with maximum detail:

code
Occurrence 1: 2026-02-13 14:23
Device: Samsung A52, Android 12, One UI 4.1
Session duration: ~2 hours
Action: Submitted form while receiving push notification
Network: LTE, 2 bars
Behavior: Loading spinner appeared then disappeared, form state reset
Logs attached: session_2026-02-13-1423.txt

Occurrence 2: 2026-02-15 09:47
Device: Xiaomi Redmi Note 11, MIUI 13
Session duration: ~45 min
Action: Standard form submission, no concurrent events
Network: Wi-Fi
Behavior: Same — spinner then reset, no error
Logs attached: session_2026-02-15-0947.txt

Multiple occurrences with detailed logs start to reveal patterns that a single occurrence can't show. Shared attributes across occurrences narrow the cause.


The Instrumentation Approach

For intermittent bugs that don't have enough occurrence documentation:

  1. Add targeted logging around the suspected code path:
kotlin
// Add detailed logging around the suspected area
Log.d(TAG, "Form submit started: userId=${user.id}, sessionAge=${sessionAge}ms, networkType=$networkType")
// ... existing code ...
Log.d(TAG, "Form submit response: status=${response.code}, body=${response.body?.string()}")
  1. Enable crash breadcrumbs for the flow:
kotlin
FirebaseCrashlytics.getInstance().log("Form submit: ${formState.serialize()}")
  1. Ship the instrumented version and wait for the next occurrence

  2. Collect the logs from the next occurrence and use them to understand the state

The "instrument and wait" approach is slow but reliable for intermittent bugs that can't be forced.


Common Intermittent Bug Patterns

Race conditions: Two concurrent operations modify shared state. Depends on which finishes first — non-deterministic.

kotlin
// Symptom: sometimes works, sometimes fails
// Cause: two coroutines both writing to the same state
launch { updateUserState() }  // Race!
launch { fetchUserData() }    // Race!

Memory-related: The bug only appears after extended use when memory is fragmented or approaching limits.

Token expiry at operation time: Session token expires between the start of a multi-step operation and its completion. Appears random but correlates with session age.

Network request interleaving: Two network requests complete in an unexpected order. Depends on network timing — non-deterministic.

[!TIP] If you have an intermittent bug with no clear pattern after 3+ occurrences, look at timing. Calculate the time between the start of the session/flow and the bug occurrence. If there's a cluster around a specific duration (e.g., 30-45 minutes), it's likely a token or cache expiry issue.


Resolution Without Consistent Reproduction

Sometimes you can fix an intermittent bug without ever reliably reproducing it:

  1. Code review the suspected area for race conditions, missing error handling, or state management issues
  2. Add defensive programming around the suspected code path (null checks, proper error handling, state re-validation)
  3. Add specific monitoring to detect the bug when it occurs post-fix
  4. Ship and monitor — if the occurrence rate drops to zero, the fix worked

This "fix by inspection and monitor" approach isn't ideal but is often the pragmatic path forward for intermittent bugs that resist reproduction.

The goal with intermittent bugs is never perfect certainty before fixing. It's reducing uncertainty enough to act with confidence.

Share:
S

Sudarshan Chaudhari

AI Systems Builder / Product Engineer

Bangkok, Thailand

Solo Android developer with 13+ years in QA, building Android apps, AI automation systems, and developer tools at SudarshanTechLabs.

Stay updated

Get new posts on Android, Kotlin, and solo dev straight to your inbox.

Newsletter preferences

Building something? Available for Android dev and QA consulting.

Work with me

Comments — powered by Giscus