Skip to content
All posts
May 10, 20266 min read

Debugging Android Production Crashes Across 22 Apps: The System I Use

When you maintain 22 Android apps simultaneously, production debugging has to be systematic. This is the exact triage process, tooling setup, and root-cause workflow I use to go from crash alert to fix in under 2 hours — even for intermittent, hard-to-reproduce issues.

AndroidDebuggingProductionFirebaseQA
Share:

22 apps. 80+ repositories. One engineer.

When a production crash comes in, I have to move fast with limited context. A crash on app 14 at 11pm after I've been working on app 7 all day — I need a system that gets me to root cause quickly, without spending 45 minutes reconstructing what that app even does.

Here's the exact system I use.


Layer 1: Crash Tagging at Initialization

The foundation is aggressive crash context tagging. Every app initializes the same way:

kotlin
class MyApplication : Application() {

    override fun onCreate() {
        super.onCreate()
        initCrashlytics()
    }

    private fun initCrashlytics() {
        FirebaseCrashlytics.getInstance().apply {
            // Device context
            setCustomKey("device_model", Build.MODEL)
            setCustomKey("manufacturer", Build.MANUFACTURER)
            setCustomKey("os_version", Build.VERSION.RELEASE)
            setCustomKey("firmware", Build.DISPLAY)          // captures OEM build
            setCustomKey("sdk_int", Build.VERSION.SDK_INT.toString())

            // App context
            setCustomKey("app_version", BuildConfig.VERSION_NAME)
            setCustomKey("version_code", BuildConfig.VERSION_CODE.toString())
            setCustomKey("build_type", BuildConfig.BUILD_TYPE)

            // Session context
            setCustomKey("session_start", System.currentTimeMillis().toString())
            setCustomKey("install_source", getInstallSource())
        }
    }

    private fun getInstallSource(): String {
        return try {
            if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.R) {
                packageManager.getInstallSourceInfo(packageName).installingPackageName
                    ?: "unknown"
            } else {
                @Suppress("DEPRECATION")
                packageManager.getInstallerPackageName(packageName) ?: "unknown"
            }
        } catch (e: Exception) {
            "error"
        }
    }
}

The

code
firmware
/
code
Build.DISPLAY
tag is the one I use most. It caught three different OS-regression crashes in 2025 — in each case, the crash rate spiked on one specific firmware build while other builds were clean.


Layer 2: Screen and Action Breadcrumbs

Crashlytics breadcrumbs tell you what the user was doing before the crash. Without them, you have a stack trace but no user journey.

kotlin
object CrashBreadcrumb {
    fun screen(name: String) {
        FirebaseCrashlytics.getInstance().log("SCREEN: $name")
    }

    fun action(name: String, params: Map<String, String> = emptyMap()) {
        val paramStr = if (params.isEmpty()) "" else " | ${params.entries.joinToString { "${it.key}=${it.value}" }}"
        FirebaseCrashlytics.getInstance().log("ACTION: $name$paramStr")
    }

    fun networkCall(endpoint: String, statusCode: Int) {
        FirebaseCrashlytics.getInstance().log("NETWORK: $endpoint → $statusCode")
    }

    fun state(key: String, value: String) {
        FirebaseCrashlytics.getInstance().setCustomKey("state_$key", value)
    }
}

Usage in practice:

kotlin
@Composable
fun HomeScreen(viewModel: HomeViewModel = hiltViewModel()) {
    LaunchedEffect(Unit) {
        CrashBreadcrumb.screen("HomeScreen")
    }

    val uiState by viewModel.uiState.collectAsStateWithLifecycle()

    HomeContent(
        uiState = uiState,
        onRefresh = {
            CrashBreadcrumb.action("Refresh", mapOf("trigger" to "user"))
            viewModel.refresh()
        }
    )
}

In Crashlytics, the breadcrumb log reads:

code
SCREEN: HomeScreen
ACTION: Refresh | trigger=user
NETWORK: /api/items → 200
SCREEN: DetailScreen

That tells me the user was on HomeScreen, refreshed, got a 200 from the API, navigated to DetailScreen — then crashed. The stack trace tells me where. The breadcrumbs tell me why.


Layer 3: The Triage Workflow

When a new crash alert arrives:

Step 1: Classify by volume (2 minutes)

Open Crashlytics → crash issue → check:

  • Crash-free rate: Is it below 99.5%? Below 99%? This determines urgency.
  • Affected versions: Does it affect the latest release only, or previous versions too?
  • Affected OS versions / firmware: Is this a regression from an OS update?
  • First occurrence: Did this start with my latest release or existed before?

If it's below 98% crash-free rate on a single firmware version and my latest release isn't new, I'm looking at an OS regression, not a code bug. Different playbook.

Step 2: Read the stack trace structurally (5 minutes)

I look for the first line that's in my code:

code
Fatal Exception: java.lang.NullPointerException
  at com.sudarshantechlabs.myfamilytracker.data.repository.LocationRepository.processUpdate(LocationRepository.kt:87)
  at com.sudarshantechlabs.myfamilytracker.data.repository.LocationRepository$locationFlow$1.invokeSuspend(LocationRepository.kt:54)
  at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)

code
LocationRepository.processUpdate:87
— that's my code. Everything below is Kotlin coroutine internals. I go directly to line 87 of
code
LocationRepository.kt
.

Step 3: Reproduce locally (10 minutes)

I look at the breadcrumbs for the user actions before the crash, then try to reproduce that sequence. 80% of crashes can be reproduced in a local build if the breadcrumbs are detailed enough.

If local reproduction fails, I set a custom key to record the specific state that led to the crash:

kotlin
// In LocationRepository, before the problematic operation:
CrashBreadcrumb.state("last_location_update", location?.toString() ?: "null")
CrashBreadcrumb.state("network_available", isNetworkAvailable.toString())
CrashBreadcrumb.state("background_state", isInBackground.toString())

Redeploy to internal testing, collect more crashes with better state data.

Step 4: Verify the fix on the affected device

The fix must be validated on:

  1. The device model/firmware where crashes were reported
  2. The Android version range affected

Not just "it compiles" — actually run the reproduction scenario on the target hardware.


Layer 4: The Intermittent Crash Problem

The hardest class of crashes: low volume, no consistent pattern, no reliable reproduction.

These are almost always timing issues: race conditions, background thread state, or UI events firing in unexpected order.

My approach:

kotlin
// Add explicit state machine to track object lifecycle
class LocationRepository @Inject constructor(
    private val locationSource: LocationDataSource
) {
    private enum class RepositoryState { IDLE, COLLECTING, STOPPED }
    private var state = RepositoryState.IDLE

    fun processUpdate(location: Location?) {
        if (state != RepositoryState.COLLECTING) {
            // Log the unexpected call instead of crashing
            FirebaseCrashlytics.getInstance().log(
                "processUpdate called in state $state — ignoring"
            )
            return
        }
        // safe to proceed
        location ?: return // explicit null guard
        // ... process
    }
}

Converting intermittent crashes into logged anomalies is often more valuable than trying to reproduce them. The non-fatal log tells you what state the object was in when the unexpected call arrived — information you can act on.


The 22-App Version: What Changes at Scale

When you maintain 22 apps with identical architecture, crashes often occur across multiple apps simultaneously. An Android OS update can affect all of them.

Cross-app monitoring dashboard:

I built a simple monitoring view that pulls crash-free rates across all 22 apps from the Crashlytics API and shows them in one screen. When an OS update drops, I can see within 4 hours which apps are affected rather than checking each app's Firebase console separately.

python
# Pseudocode — actual impl uses Firebase Admin SDK
def get_crash_rates(app_ids: list[str]) -> dict:
    return {
        app_id: crashlytics_client.get_crash_free_rate(
            app_id=app_id,
            period_days=1
        )
        for app_id in app_ids
    }

# Alert if any app drops below threshold
for app_id, rate in get_crash_rates(ALL_APP_IDS).items():
    if rate < 0.995:
        send_alert(f"{app_id}: crash-free rate {rate:.1%}")

Shared root causes:

When multiple apps crash on the same day, the root cause is almost always:

  1. An Android/OEM OS update (check
    code
    Build.DISPLAY
    clustering)
  2. A shared library I updated across all apps simultaneously
  3. A backend API change affecting all apps

The tagging system identifies which is which: if the firmware version clusters, it's #1. If it affects all versions equally, it's #2 or #3.


The Metric That Matters: Time to Root Cause

Crash rate is a lagging indicator. Time to root cause is what you can improve.

Before the tagging and breadcrumb system: average 4-6 hours to understand intermittent production crashes.

After: average 90 minutes to root cause for common crash types, 4 hours for novel issues.

The investment is in the logging infrastructure, not in being clever after a crash happens. When the crash arrives, the data is already there.

The best debugging strategy is the one you set up before you need it. Every crash that's hard to debug is a signal to add more structured logging before the next one.

Share:
S

Sudarshan Chaudhari

AI Systems Builder / Product Engineer

Bangkok, Thailand

Solo Android developer with 13+ years in QA, building Android apps, AI automation systems, and developer tools at SudarshanTechLabs.

Stay updated

Get new posts on Android, Kotlin, and solo dev straight to your inbox.

Newsletter preferences

Related Apps

MyFamilyTracker

Real-time family location sharing — Firebase Realtime DB for sub-second propagation, WorkManager + ForegroundService for OS-compliant background collection, geofencing via Google Maps API.

Building something? Available for Android dev and QA consulting.

Work with me

Comments — powered by Giscus

Apps tagged with this