Intermittent Bugs: How to Handle the Hardest Bugs in QA
Intermittent bugs appear inconsistently, defy reproduction, and are the hardest class of issues to resolve. Here's a systematic approach to investigating, documenting, and ultimately resolving them.
On this page
Intermittent bugs are the most frustrating category in QA. They're real — you've seen them. But they don't happen consistently enough to pin down. Developers can't reproduce them. "Cannot reproduce" tickets pile up. And then the same bug surfaces in production.
Here's how to handle them systematically.
Why Intermittent Bugs Are Hard
Non-deterministic reproduction. By definition, you can't make them happen on demand. Traditional bug investigation assumes you can reliably trigger the failure. Intermittent bugs break this assumption.
Time-sensitive state. Many intermittent bugs depend on state that changes — session tokens that expire, network requests that race, background jobs that run at specific intervals. By the time you investigate, the state has changed.
Multiple contributing factors. A bug that only appears when: the user has been logged in for 4+ hours AND is on a cellular connection AND has low storage — is not going to be found by testing any one of those factors individually.
Classification: Is It Actually Intermittent?
First, verify you're dealing with a genuinely intermittent bug. Some "intermittent" bugs are:
- Consistently reproducible with the right steps (you're missing a precondition)
- Device-specific (consistent on that device, not on yours)
- Account-state-specific (consistent for users with that data state, not your test account)
- Timing-related (consistent if you reproduce the exact timing)
Test these alternatives before accepting "intermittent" as the classification. A bug that's 1-in-10 is intermittent. A bug that reproduces 10/10 on a Samsung with MIUI but 0/10 on your Pixel is device-specific, which is actually more tractable.
Documentation Strategy for Intermittent Bugs
Document every occurrence with maximum detail:
Occurrence 1: 2026-02-13 14:23
Device: Samsung A52, Android 12, One UI 4.1
Session duration: ~2 hours
Action: Submitted form while receiving push notification
Network: LTE, 2 bars
Behavior: Loading spinner appeared then disappeared, form state reset
Logs attached: session_2026-02-13-1423.txt
Occurrence 2: 2026-02-15 09:47
Device: Xiaomi Redmi Note 11, MIUI 13
Session duration: ~45 min
Action: Standard form submission, no concurrent events
Network: Wi-Fi
Behavior: Same — spinner then reset, no error
Logs attached: session_2026-02-15-0947.txtMultiple occurrences with detailed logs start to reveal patterns that a single occurrence can't show. Shared attributes across occurrences narrow the cause.
The Instrumentation Approach
For intermittent bugs that don't have enough occurrence documentation:
- Add targeted logging around the suspected code path:
// Add detailed logging around the suspected area
Log.d(TAG, "Form submit started: userId=${user.id}, sessionAge=${sessionAge}ms, networkType=$networkType")
// ... existing code ...
Log.d(TAG, "Form submit response: status=${response.code}, body=${response.body?.string()}")- Enable crash breadcrumbs for the flow:
FirebaseCrashlytics.getInstance().log("Form submit: ${formState.serialize()}")-
Ship the instrumented version and wait for the next occurrence
-
Collect the logs from the next occurrence and use them to understand the state
The "instrument and wait" approach is slow but reliable for intermittent bugs that can't be forced.
Common Intermittent Bug Patterns
Race conditions: Two concurrent operations modify shared state. Depends on which finishes first — non-deterministic.
// Symptom: sometimes works, sometimes fails
// Cause: two coroutines both writing to the same state
launch { updateUserState() } // Race!
launch { fetchUserData() } // Race!Memory-related: The bug only appears after extended use when memory is fragmented or approaching limits.
Token expiry at operation time: Session token expires between the start of a multi-step operation and its completion. Appears random but correlates with session age.
Network request interleaving: Two network requests complete in an unexpected order. Depends on network timing — non-deterministic.
[!TIP] If you have an intermittent bug with no clear pattern after 3+ occurrences, look at timing. Calculate the time between the start of the session/flow and the bug occurrence. If there's a cluster around a specific duration (e.g., 30-45 minutes), it's likely a token or cache expiry issue.
Resolution Without Consistent Reproduction
Sometimes you can fix an intermittent bug without ever reliably reproducing it:
- Code review the suspected area for race conditions, missing error handling, or state management issues
- Add defensive programming around the suspected code path (null checks, proper error handling, state re-validation)
- Add specific monitoring to detect the bug when it occurs post-fix
- Ship and monitor — if the occurrence rate drops to zero, the fix worked
This "fix by inspection and monitor" approach isn't ideal but is often the pragmatic path forward for intermittent bugs that resist reproduction.
The goal with intermittent bugs is never perfect certainty before fixing. It's reducing uncertainty enough to act with confidence.
Sudarshan Chaudhari
AI Systems Builder / Product Engineer
Bangkok, Thailand
Solo Android developer with 13+ years in QA, building Android apps, AI automation systems, and developer tools at SudarshanTechLabs.
Related Posts
Building something? Available for Android dev and QA consulting.
Work with meComments — powered by Giscus
