Skip to content

fix: prevent binder buffer exhaustion causing crash loop#652

Merged
torlando-tech merged 2 commits intomainfrom
fix/binder-buffer-exhaustion-647
Mar 11, 2026
Merged

fix: prevent binder buffer exhaustion causing crash loop#652
torlando-tech merged 2 commits intomainfrom
fix/binder-buffer-exhaustion-647

Conversation

@torlando-tech
Copy link
Copy Markdown
Owner

Summary

  • Catch RemoteException on provideAlternativeRelay() binder callbacks to prevent unhandled DeadObjectException fatal crashes (COLUMBA-14: 95 events/27 users, COLUMBA-30: 4 events/4 users)
  • Cache getPathTableHashes() results for 30s in RoutingManager to avoid blocking binder threads for 39+ seconds on large path tables, which saturates the 1MB per-process binder buffer
  • Reduce identity restore batch size from 500→200 and add 100ms delay between batches to let the binder buffer drain during initialization

Root Cause

Binder transactions to IReticulumService (particularly getPathTableHashes, code 19) were taking 39-44 seconds, holding binder threads and filling the 1MB per-process binder buffer. Once saturated, Room's enableMultiInstanceInvalidation() binder IPC also fails with DeadObjectException, causing a crash loop that doesn't self-recover.

Test plan

  • compileNoSentryDebugKotlin builds clean
  • AnnounceStreamViewModelTest passes (exercises getPathTableHashes path)
  • Install on device with large path table (1000+ entries) and verify no binder warnings in logcat
  • Verify identity restoration completes during cold start without ANR

Fixes #647
Fixes COLUMBA-14
Fixes COLUMBA-30

🤖 Generated with Claude Code

Addresses binder IPC saturation that leads to cascading
DeadObjectException crashes in Room's multi-instance invalidation.

1. Catch RemoteException on provideAlternativeRelay() binder callbacks
   to prevent unhandled DeadObjectException fatal crashes (COLUMBA-14,
   COLUMBA-30)

2. Cache getPathTableHashes() results for 30s in RoutingManager to
   avoid blocking binder threads for 39+ seconds on large path tables,
   which saturates the 1MB per-process binder buffer

3. Reduce identity restore batch size from 500 to 200 and add 100ms
   delay between batches to let the binder buffer drain during
   initialization

Fixes #647
Fixes COLUMBA-14
Fixes COLUMBA-30

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR addresses a binder-buffer-exhaustion crash loop (COLUMBA-14, COLUMBA-30) through three coordinated changes: catching RemoteException on the provideAlternativeRelay binder callback, caching getPathTableHashes() results for 30 s to avoid repeated 39-second Python calls, and reducing the identity-restore batch size to 200 with a 100ms inter-batch drain delay. The two issues flagged in earlier review rounds (@Synchronized thread-safety and error-result caching) have both been addressed in this revision.

  • RoutingManager.getPathTableHashes(): @Synchronized correctly serializes concurrent callers and the error path now falls back to stale cache rather than writing an empty result. The residual concern is that on a cold cache miss, waiting coroutines pin their dispatcher threads at the JVM monitor for up to ~39 s; a kotlinx.coroutines.sync.Mutex would suspend rather than block those threads.
  • InterfaceConfigManager.restoreInBatches(): delay(100) is called with its fully-qualified name (kotlinx.coroutines.delay) despite delay already being imported; and the yield() immediately after it is redundant since delay is itself a suspension point that gives the scheduler an opportunity to run other coroutines.
  • ServiceReticulumProtocol: RemoteException handling is correctly ordered before the general Exception catch, and the import of android.os.RemoteException was already present — the change is clean and correct.

Confidence Score: 4/5

  • Safe to merge with minor style improvements recommended; core logic is sound and both previously-raised critical issues are resolved.
  • All three root-cause fixes are logically correct: RemoteException guards prevent the fatal crash loop, the cache avoids repeated binder-exhausting Python calls, and smaller batches with drainage delays reduce peak binder pressure. The only remaining concerns are stylistic — redundant yield() / fully-qualified delay in InterfaceConfigManager, and the preference for a coroutine Mutex over @Synchronized in RoutingManager to avoid blocking IO-dispatcher threads on a cache miss.
  • RoutingManager.kt deserves a second look if getPathTableHashes() is ever called from coroutines on Dispatchers.Default or a constrained thread pool, due to the blocking nature of @Synchronized.

Important Files Changed

Filename Overview
app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt Added 30-second TTL cache for getPathTableHashes() with @Synchronized to prevent repeated 39-second Python calls from saturating the binder buffer; error path now falls back to stale cache. @Synchronized correctly addresses the previously-flagged race condition but can pin IO-dispatcher threads on a cache miss; a coroutine Mutex would be friendlier to the thread pool.
app/src/main/java/com/lxmf/messenger/reticulum/protocol/ServiceReticulumProtocol.kt Added RemoteException catch before the general Exception handler in the provideAlternativeRelay callback path, and wrapped the error-recovery provideAlternativeRelay(null) call in its own RemoteException guard. Ordering and logic are correct; android.os.RemoteException is already imported.
app/src/main/java/com/lxmf/messenger/service/InterfaceConfigManager.kt Default batch size reduced from 500 → 200 and a 100ms inter-batch delay added to let the binder buffer drain. The new delay(100) is referenced by its fully-qualified name despite delay already being imported, and the yield() immediately following it is redundant since delay() is already a suspension point.

Sequence Diagram

sequenceDiagram
    participant Caller as Caller (ViewModel/Coroutine)
    participant RM as RoutingManager
    participant Cache as In-Memory Cache
    participant PY as Python (Chaquopy)
    participant Binder as Android Binder (1MB buffer)

    Note over Caller,Binder: Cache HIT path (within 30s TTL)
    Caller->>RM: getPathTableHashes()
    RM->>Cache: read cachedPathTableJson
    Cache-->>RM: non-null & fresh
    RM-->>Caller: return cached JSON (fast)

    Note over Caller,Binder: Cache MISS path (cold start / expired)
    Caller->>RM: getPathTableHashes() [@Synchronized — blocks thread]
    RM->>Cache: read cachedPathTableJson
    Cache-->>RM: null or stale
    RM->>PY: get_path_table() [up to 39s]
    PY-->>Binder: large result over IPC
    Binder-->>PY: ok
    PY-->>RM: JSONArray string
    RM->>Cache: write result + timestamp
    RM-->>Caller: return fresh JSON

    Note over Caller,Binder: Error / wrapper null path
    RM->>PY: get_path_table()
    PY-->>RM: Exception / null
    RM->>Cache: read stale cachedPathTableJson
    Cache-->>RM: stale or null
    RM-->>Caller: return stale JSON or "[]"

    Note over Caller,Binder: provideAlternativeRelay binder callback
    Caller->>RM: alternativeRelayHandler(excludeHashes)
    RM-->>Caller: alternativeRelay
    Caller->>Binder: provideAlternativeRelay(alternativeRelay)
    alt Service alive
        Binder-->>Caller: ok
    else DeadObjectException / RemoteException
        Binder-->>Caller: RemoteException (caught, logged, swallowed)
    end
Loading

Last reviewed commit: 67c3466

- Add @synchronized to prevent concurrent callers from both bypassing
  the cache and invoking the slow Python call simultaneously
- Only cache successful results; on error return stale cache if
  available, otherwise "[]" — prevents 30s lockout on transient failures

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@torlando-tech
Copy link
Copy Markdown
Owner Author

@greptileai

} else {
// Brief delay lets the binder buffer drain between batches,
// preventing buffer saturation on devices with slow Python execution.
kotlinx.coroutines.delay(100)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully-qualified delay call is redundant given existing import

InterfaceConfigManager.kt already imports kotlinx.coroutines.delay at the top of the file (line 22), so the fully-qualified reference kotlinx.coroutines.delay(100) is unnecessary and inconsistent with how yield() is used just below it.

Suggested change
kotlinx.coroutines.delay(100)
delay(100)
Prompt To Fix With AI
This is a comment left during a code review.
Path: app/src/main/java/com/lxmf/messenger/service/InterfaceConfigManager.kt
Line: 428

Comment:
**Fully-qualified `delay` call is redundant given existing import**

`InterfaceConfigManager.kt` already imports `kotlinx.coroutines.delay` at the top of the file (line 22), so the fully-qualified reference `kotlinx.coroutines.delay(100)` is unnecessary and inconsistent with how `yield()` is used just below it.

```suggestion
                            delay(100)
```

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +428 to 429
kotlinx.coroutines.delay(100)
yield() // Let GC reclaim previous batch's bridge objects
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yield() is redundant after delay()

kotlinx.coroutines.delay(100) is already a suspension point — it suspends the current coroutine for 100ms, during which the scheduler is free to run any other pending coroutines. Calling yield() immediately after provides no additional benefit; the scheduler already had 100ms to work with. The original purpose of yield() (letting GC reclaim previous batch bridge objects) is already served by the suspension inside delay.

Suggested change
kotlinx.coroutines.delay(100)
yield() // Let GC reclaim previous batch's bridge objects
delay(100) // Let binder buffer drain and GC reclaim previous batch's bridge objects
Prompt To Fix With AI
This is a comment left during a code review.
Path: app/src/main/java/com/lxmf/messenger/service/InterfaceConfigManager.kt
Line: 428-429

Comment:
**`yield()` is redundant after `delay()`**

`kotlinx.coroutines.delay(100)` is already a suspension point — it suspends the current coroutine for 100ms, during which the scheduler is free to run any other pending coroutines. Calling `yield()` immediately after provides no additional benefit; the scheduler already had 100ms to work with. The original purpose of `yield()` (letting GC reclaim previous batch bridge objects) is already served by the suspension inside `delay`.

```suggestion
                            delay(100) // Let binder buffer drain and GC reclaim previous batch's bridge objects
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +112 to 113
@Synchronized
fun getPathTableHashes(): String {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Synchronized can starve coroutine dispatcher threads on a cache miss

@Synchronized makes the JVM monitor the RoutingManager instance. If getPathTableHashes() is called concurrently from coroutines running on Dispatchers.IO while the cache is cold/expired, every waiting coroutine pins a thread pool thread at the lock boundary for the full duration of the Python call (up to ~39 s). With enough concurrent callers this can exhaust the Dispatchers.IO thread pool (64 threads by default on Android) and stall other IO operations.

A coroutine-friendly alternative is a kotlinx.coroutines.sync.Mutex, which suspends rather than blocks the thread:

private val cacheMutex = Mutex()

suspend fun getPathTableHashes(): String {
    val now = System.currentTimeMillis()
    val cached = cachedPathTableJson
    if (cached != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) {
        return cached
    }
    return cacheMutex.withLock {
        // Re-check inside the lock in case another coroutine already refreshed
        val rechecked = cachedPathTableJson
        if (rechecked != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) {
            return@withLock rechecked
        }
        val result = wrapperManager.withWrapper { wrapper ->
            try {
                val pyResult = wrapper.callAttr("get_path_table")
                val hashes = pyResult.asList().map { it.toString() }
                JSONArray(hashes).toString()
            } catch (e: Exception) {
                Log.e(TAG, "Error getting path table hashes", e)
                null
            }
        }
        if (result != null) {
            cachedPathTableJson = result
            cachedPathTableTimestamp = System.currentTimeMillis()
            result
        } else {
            cachedPathTableJson ?: "[]"
        }
    }
}

If making the function suspend is not feasible right now, at minimum ensuring it is always dispatched to Dispatchers.IO at the call sites will contain the thread-blocking to the IO pool.

Prompt To Fix With AI
This is a comment left during a code review.
Path: app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt
Line: 112-113

Comment:
**`@Synchronized` can starve coroutine dispatcher threads on a cache miss**

`@Synchronized` makes the JVM monitor the `RoutingManager` instance. If `getPathTableHashes()` is called concurrently from coroutines running on `Dispatchers.IO` while the cache is cold/expired, every waiting coroutine pins a thread pool thread at the lock boundary for the full duration of the Python call (up to ~39 s). With enough concurrent callers this can exhaust the `Dispatchers.IO` thread pool (64 threads by default on Android) and stall other IO operations.

A coroutine-friendly alternative is a `kotlinx.coroutines.sync.Mutex`, which suspends rather than blocks the thread:

```kotlin
private val cacheMutex = Mutex()

suspend fun getPathTableHashes(): String {
    val now = System.currentTimeMillis()
    val cached = cachedPathTableJson
    if (cached != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) {
        return cached
    }
    return cacheMutex.withLock {
        // Re-check inside the lock in case another coroutine already refreshed
        val rechecked = cachedPathTableJson
        if (rechecked != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) {
            return@withLock rechecked
        }
        val result = wrapperManager.withWrapper { wrapper ->
            try {
                val pyResult = wrapper.callAttr("get_path_table")
                val hashes = pyResult.asList().map { it.toString() }
                JSONArray(hashes).toString()
            } catch (e: Exception) {
                Log.e(TAG, "Error getting path table hashes", e)
                null
            }
        }
        if (result != null) {
            cachedPathTableJson = result
            cachedPathTableTimestamp = System.currentTimeMillis()
            result
        } else {
            cachedPathTableJson ?: "[]"
        }
    }
}
```

If making the function `suspend` is not feasible right now, at minimum ensuring it is always dispatched to `Dispatchers.IO` at the call sites will contain the thread-blocking to the IO pool.

How can I resolve this? If you propose a fix, please make it concise.

@torlando-tech torlando-tech merged commit 1c78f87 into main Mar 11, 2026
13 checks passed
@torlando-tech torlando-tech deleted the fix/binder-buffer-exhaustion-647 branch March 11, 2026 01:43
@sentry
Copy link
Copy Markdown
Contributor

sentry bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

libbinder javabinder error:-28(No space left on device) and it does not recover on it own

1 participant