fix: prevent binder buffer exhaustion causing crash loop by torlando-tech · Pull Request #652 · torlando-tech/columba

torlando-tech · 2026-03-11T00:17:25Z

Summary

Catch RemoteException on provideAlternativeRelay() binder callbacks to prevent unhandled DeadObjectException fatal crashes (COLUMBA-14: 95 events/27 users, COLUMBA-30: 4 events/4 users)
Cache getPathTableHashes() results for 30s in RoutingManager to avoid blocking binder threads for 39+ seconds on large path tables, which saturates the 1MB per-process binder buffer
Reduce identity restore batch size from 500→200 and add 100ms delay between batches to let the binder buffer drain during initialization

Root Cause

Binder transactions to IReticulumService (particularly getPathTableHashes, code 19) were taking 39-44 seconds, holding binder threads and filling the 1MB per-process binder buffer. Once saturated, Room's enableMultiInstanceInvalidation() binder IPC also fails with DeadObjectException, causing a crash loop that doesn't self-recover.

Test plan

compileNoSentryDebugKotlin builds clean
AnnounceStreamViewModelTest passes (exercises getPathTableHashes path)
Install on device with large path table (1000+ entries) and verify no binder warnings in logcat
Verify identity restoration completes during cold start without ANR

Fixes #647
Fixes COLUMBA-14
Fixes COLUMBA-30

🤖 Generated with Claude Code

Addresses binder IPC saturation that leads to cascading DeadObjectException crashes in Room's multi-instance invalidation. 1. Catch RemoteException on provideAlternativeRelay() binder callbacks to prevent unhandled DeadObjectException fatal crashes (COLUMBA-14, COLUMBA-30) 2. Cache getPathTableHashes() results for 30s in RoutingManager to avoid blocking binder threads for 39+ seconds on large path tables, which saturates the 1MB per-process binder buffer 3. Reduce identity restore batch size from 500 to 200 and add 100ms delay between batches to let the binder buffer drain during initialization Fixes #647 Fixes COLUMBA-14 Fixes COLUMBA-30 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

greptile-apps · 2026-03-11T00:20:04Z

Greptile Summary

This PR addresses a binder-buffer-exhaustion crash loop (COLUMBA-14, COLUMBA-30) through three coordinated changes: catching RemoteException on the provideAlternativeRelay binder callback, caching getPathTableHashes() results for 30 s to avoid repeated 39-second Python calls, and reducing the identity-restore batch size to 200 with a 100ms inter-batch drain delay. The two issues flagged in earlier review rounds (@Synchronized thread-safety and error-result caching) have both been addressed in this revision.

RoutingManager.getPathTableHashes(): @Synchronized correctly serializes concurrent callers and the error path now falls back to stale cache rather than writing an empty result. The residual concern is that on a cold cache miss, waiting coroutines pin their dispatcher threads at the JVM monitor for up to ~39 s; a kotlinx.coroutines.sync.Mutex would suspend rather than block those threads.
InterfaceConfigManager.restoreInBatches(): delay(100) is called with its fully-qualified name (kotlinx.coroutines.delay) despite delay already being imported; and the yield() immediately after it is redundant since delay is itself a suspension point that gives the scheduler an opportunity to run other coroutines.
ServiceReticulumProtocol: RemoteException handling is correctly ordered before the general Exception catch, and the import of android.os.RemoteException was already present — the change is clean and correct.

Confidence Score: 4/5

Safe to merge with minor style improvements recommended; core logic is sound and both previously-raised critical issues are resolved.
All three root-cause fixes are logically correct: RemoteException guards prevent the fatal crash loop, the cache avoids repeated binder-exhausting Python calls, and smaller batches with drainage delays reduce peak binder pressure. The only remaining concerns are stylistic — redundant yield() / fully-qualified delay in InterfaceConfigManager, and the preference for a coroutine Mutex over @Synchronized in RoutingManager to avoid blocking IO-dispatcher threads on a cache miss.
RoutingManager.kt deserves a second look if getPathTableHashes() is ever called from coroutines on Dispatchers.Default or a constrained thread pool, due to the blocking nature of @Synchronized.

Important Files Changed

Filename	Overview
app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt	Added 30-second TTL cache for `getPathTableHashes()` with `@Synchronized` to prevent repeated 39-second Python calls from saturating the binder buffer; error path now falls back to stale cache. `@Synchronized` correctly addresses the previously-flagged race condition but can pin IO-dispatcher threads on a cache miss; a coroutine `Mutex` would be friendlier to the thread pool.
app/src/main/java/com/lxmf/messenger/reticulum/protocol/ServiceReticulumProtocol.kt	Added `RemoteException` catch before the general `Exception` handler in the `provideAlternativeRelay` callback path, and wrapped the error-recovery `provideAlternativeRelay(null)` call in its own `RemoteException` guard. Ordering and logic are correct; `android.os.RemoteException` is already imported.
app/src/main/java/com/lxmf/messenger/service/InterfaceConfigManager.kt	Default batch size reduced from 500 → 200 and a 100ms inter-batch delay added to let the binder buffer drain. The new `delay(100)` is referenced by its fully-qualified name despite `delay` already being imported, and the `yield()` immediately following it is redundant since `delay()` is already a suspension point.

Sequence Diagram

sequenceDiagram
    participant Caller as Caller (ViewModel/Coroutine)
    participant RM as RoutingManager
    participant Cache as In-Memory Cache
    participant PY as Python (Chaquopy)
    participant Binder as Android Binder (1MB buffer)

    Note over Caller,Binder: Cache HIT path (within 30s TTL)
    Caller->>RM: getPathTableHashes()
    RM->>Cache: read cachedPathTableJson
    Cache-->>RM: non-null & fresh
    RM-->>Caller: return cached JSON (fast)

    Note over Caller,Binder: Cache MISS path (cold start / expired)
    Caller->>RM: getPathTableHashes() [@Synchronized — blocks thread]
    RM->>Cache: read cachedPathTableJson
    Cache-->>RM: null or stale
    RM->>PY: get_path_table() [up to 39s]
    PY-->>Binder: large result over IPC
    Binder-->>PY: ok
    PY-->>RM: JSONArray string
    RM->>Cache: write result + timestamp
    RM-->>Caller: return fresh JSON

    Note over Caller,Binder: Error / wrapper null path
    RM->>PY: get_path_table()
    PY-->>RM: Exception / null
    RM->>Cache: read stale cachedPathTableJson
    Cache-->>RM: stale or null
    RM-->>Caller: return stale JSON or "[]"

    Note over Caller,Binder: provideAlternativeRelay binder callback
    Caller->>RM: alternativeRelayHandler(excludeHashes)
    RM-->>Caller: alternativeRelay
    Caller->>Binder: provideAlternativeRelay(alternativeRelay)
    alt Service alive
        Binder-->>Caller: ok
    else DeadObjectException / RemoteException
        Binder-->>Caller: RemoteException (caught, logged, swallowed)
    end

_{Last reviewed commit: 67c3466}

app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt

@synchronized

- Add @synchronized to prevent concurrent callers from both bypassing the cache and invoking the slow Python call simultaneously - Only cache successful results; on error return stale cache if available, otherwise "[]" — prevents 30s lockout on transient failures Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

torlando-tech · 2026-03-11T01:38:32Z

@greptileai

greptile-apps · 2026-03-11T01:41:59Z

app/src/main/java/com/lxmf/messenger/service/InterfaceConfigManager.kt

                        } else {
+                            // Brief delay lets the binder buffer drain between batches,
+                            // preventing buffer saturation on devices with slow Python execution.
+                            kotlinx.coroutines.delay(100)


Fully-qualified delay call is redundant given existing import

InterfaceConfigManager.kt already imports kotlinx.coroutines.delay at the top of the file (line 22), so the fully-qualified reference kotlinx.coroutines.delay(100) is unnecessary and inconsistent with how yield() is used just below it.

Suggested change

kotlinx.coroutines.delay(100)

delay(100)

Prompt To Fix With AI

This is a comment left during a code review. Path: app/src/main/java/com/lxmf/messenger/service/InterfaceConfigManager.kt Line: 428 Comment: **Fully-qualified `delay` call is redundant given existing import** `InterfaceConfigManager.kt` already imports `kotlinx.coroutines.delay` at the top of the file (line 22), so the fully-qualified reference `kotlinx.coroutines.delay(100)` is unnecessary and inconsistent with how `yield()` is used just below it. ```suggestion delay(100) ``` How can I resolve this? If you propose a fix, please make it concise.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-03-11T01:42:00Z

app/src/main/java/com/lxmf/messenger/service/InterfaceConfigManager.kt

+                            kotlinx.coroutines.delay(100)
                            yield() // Let GC reclaim previous batch's bridge objects


yield() is redundant after delay()

kotlinx.coroutines.delay(100) is already a suspension point — it suspends the current coroutine for 100ms, during which the scheduler is free to run any other pending coroutines. Calling yield() immediately after provides no additional benefit; the scheduler already had 100ms to work with. The original purpose of yield() (letting GC reclaim previous batch bridge objects) is already served by the suspension inside delay.

Suggested change

kotlinx.coroutines.delay(100)

yield() // Let GC reclaim previous batch's bridge objects

delay(100) // Let binder buffer drain and GC reclaim previous batch's bridge objects

Prompt To Fix With AI

This is a comment left during a code review. Path: app/src/main/java/com/lxmf/messenger/service/InterfaceConfigManager.kt Line: 428-429 Comment: **`yield()` is redundant after `delay()`** `kotlinx.coroutines.delay(100)` is already a suspension point — it suspends the current coroutine for 100ms, during which the scheduler is free to run any other pending coroutines. Calling `yield()` immediately after provides no additional benefit; the scheduler already had 100ms to work with. The original purpose of `yield()` (letting GC reclaim previous batch bridge objects) is already served by the suspension inside `delay`. ```suggestion delay(100) // Let binder buffer drain and GC reclaim previous batch's bridge objects ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-11T01:42:01Z

app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt

+    @Synchronized
    fun getPathTableHashes(): String {


@Synchronized can starve coroutine dispatcher threads on a cache miss

@Synchronized makes the JVM monitor the RoutingManager instance. If getPathTableHashes() is called concurrently from coroutines running on Dispatchers.IO while the cache is cold/expired, every waiting coroutine pins a thread pool thread at the lock boundary for the full duration of the Python call (up to ~39 s). With enough concurrent callers this can exhaust the Dispatchers.IO thread pool (64 threads by default on Android) and stall other IO operations.

A coroutine-friendly alternative is a kotlinx.coroutines.sync.Mutex, which suspends rather than blocks the thread:

private val cacheMutex = Mutex() suspend fun getPathTableHashes(): String { val now = System.currentTimeMillis() val cached = cachedPathTableJson if (cached != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) { return cached } return cacheMutex.withLock { // Re-check inside the lock in case another coroutine already refreshed val rechecked = cachedPathTableJson if (rechecked != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) { return@withLock rechecked } val result = wrapperManager.withWrapper { wrapper -> try { val pyResult = wrapper.callAttr("get_path_table") val hashes = pyResult.asList().map { it.toString() } JSONArray(hashes).toString() } catch (e: Exception) { Log.e(TAG, "Error getting path table hashes", e) null } } if (result != null) { cachedPathTableJson = result cachedPathTableTimestamp = System.currentTimeMillis() result } else { cachedPathTableJson ?: "[]" } } }

If making the function suspend is not feasible right now, at minimum ensuring it is always dispatched to Dispatchers.IO at the call sites will contain the thread-blocking to the IO pool.

Prompt To Fix With AI

This is a comment left during a code review. Path: app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt Line: 112-113 Comment: **`@Synchronized` can starve coroutine dispatcher threads on a cache miss** `@Synchronized` makes the JVM monitor the `RoutingManager` instance. If `getPathTableHashes()` is called concurrently from coroutines running on `Dispatchers.IO` while the cache is cold/expired, every waiting coroutine pins a thread pool thread at the lock boundary for the full duration of the Python call (up to ~39 s). With enough concurrent callers this can exhaust the `Dispatchers.IO` thread pool (64 threads by default on Android) and stall other IO operations. A coroutine-friendly alternative is a `kotlinx.coroutines.sync.Mutex`, which suspends rather than blocks the thread: ```kotlin private val cacheMutex = Mutex() suspend fun getPathTableHashes(): String { val now = System.currentTimeMillis() val cached = cachedPathTableJson if (cached != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) { return cached } return cacheMutex.withLock { // Re-check inside the lock in case another coroutine already refreshed val rechecked = cachedPathTableJson if (rechecked != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) { return@withLock rechecked } val result = wrapperManager.withWrapper { wrapper -> try { val pyResult = wrapper.callAttr("get_path_table") val hashes = pyResult.asList().map { it.toString() } JSONArray(hashes).toString() } catch (e: Exception) { Log.e(TAG, "Error getting path table hashes", e) null } } if (result != null) { cachedPathTableJson = result cachedPathTableTimestamp = System.currentTimeMillis() result } else { cachedPathTableJson ?: "[]" } } } ``` If making the function `suspend` is not feasible right now, at minimum ensuring it is always dispatched to `Dispatchers.IO` at the call sites will contain the thread-blocking to the IO pool. How can I resolve this? If you propose a fix, please make it concise.

sentry · 2026-03-11T02:12:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt Outdated Show resolved Hide resolved

app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt Outdated Show resolved Hide resolved

torlando-tech mentioned this pull request Mar 11, 2026

fix: prevent binder buffer exhaustion causing crash loop #654

Merged

4 tasks

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

torlando-tech merged commit 1c78f87 into main Mar 11, 2026
13 checks passed

torlando-tech deleted the fix/binder-buffer-exhaustion-647 branch March 11, 2026 01:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent binder buffer exhaustion causing crash loop#652

fix: prevent binder buffer exhaustion causing crash loop#652
torlando-tech merged 2 commits intomainfrom
fix/binder-buffer-exhaustion-647

torlando-tech commented Mar 11, 2026

Uh oh!

greptile-apps bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

torlando-tech commented Mar 11, 2026

Uh oh!

greptile-apps bot Mar 11, 2026

Uh oh!

greptile-apps bot Mar 11, 2026

Uh oh!

greptile-apps bot Mar 11, 2026

Uh oh!

Uh oh!

sentry bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		kotlinx.coroutines.delay(100)
		yield() // Let GC reclaim previous batch's bridge objects

	kotlinx.coroutines.delay(100)
	yield() // Let GC reclaim previous batch's bridge objects
	delay(100) // Let binder buffer drain and GC reclaim previous batch's bridge objects

Uh oh!

Conversation

torlando-tech commented Mar 11, 2026

Summary

Root Cause

Test plan

Uh oh!

greptile-apps bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

torlando-tech commented Mar 11, 2026

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sentry bot commented Mar 11, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Mar 11, 2026 •

edited

Loading