fix: prevent binder buffer exhaustion causing crash loop by torlando-tech · Pull Request #654 · torlando-tech/columba

torlando-tech · 2026-03-11T01:40:20Z

Summary

Catch RemoteException on provideAlternativeRelay() binder callbacks to prevent unhandled DeadObjectException fatal crashes (COLUMBA-14: 95 events/27 users, COLUMBA-30: 4 events/4 users)
Cache getPathTableHashes() results for 30s in RoutingManager (thread-safe, errors not cached) to avoid blocking binder threads for 39+ seconds on large path tables, which saturates the 1MB per-process binder buffer
Reduce identity restore batch size from 500→200 and add 100ms delay between batches to let the binder buffer drain during initialization

Root Cause

Binder transactions to IReticulumService (particularly getPathTableHashes, code 19) were taking 39-44 seconds, holding binder threads and filling the 1MB per-process binder buffer. Once saturated, Room's enableMultiInstanceInvalidation() binder IPC also fails with DeadObjectException, causing a crash loop that doesn't self-recover.

Cherry-pick of ba43d47, 67c3466 from fix/binder-buffer-exhaustion-647 (PR #652).

Test plan

compileNoSentryDebugKotlin builds clean
AnnounceStreamViewModelTest passes
Install on device with large path table (1000+ entries) and verify no binder warnings in logcat
Verify identity restoration completes during cold start without ANR

Fixes #647
Fixes COLUMBA-14
Fixes COLUMBA-30

🤖 Generated with Claude Code

Addresses binder IPC saturation that leads to cascading DeadObjectException crashes in Room's multi-instance invalidation. 1. Catch RemoteException on provideAlternativeRelay() binder callbacks to prevent unhandled DeadObjectException fatal crashes (COLUMBA-14, COLUMBA-30) 2. Cache getPathTableHashes() results for 30s in RoutingManager to avoid blocking binder threads for 39+ seconds on large path tables, which saturates the 1MB per-process binder buffer 3. Reduce identity restore batch size from 500 to 200 and add 100ms delay between batches to let the binder buffer drain during initialization Fixes #647 Fixes COLUMBA-14 Fixes COLUMBA-30 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@synchronized

- Add @synchronized to prevent concurrent callers from both bypassing the cache and invoking the slow Python call simultaneously - Only cache successful results; on error return stale cache if available, otherwise "[]" — prevents 30s lockout on transient failures Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

greptile-apps · 2026-03-11T01:43:36Z

Greptile Summary

This PR addresses a binder buffer exhaustion crash loop by: (1) catching RemoteException on binder callbacks in ServiceReticulumProtocol, (2) caching slow getPathTableHashes() Python calls for 30 s in RoutingManager, and (3) reducing identity-restore batch size and adding inter-batch delays in InterfaceConfigManager. The fixes are well-targeted and directly address the crash scenario reported in COLUMBA-14 and COLUMBA-30.

Key changes:

ServiceReticulumProtocol: RemoteException is now caught before the generic Exception handler in the provideAlternativeRelay callback, and the error-recovery binder call is also guarded — cleanly breaking the crash loop.
RoutingManager: A 30-second TTL cache is added to getPathTableHashes(). The cache logic is correct, but @Synchronized holds the lock for the entire Python call (up to 39 s), meaning concurrent callers on a cold-cache event will all block on the lock — potentially re-introducing binder-thread exhaustion at TTL-expiry boundaries. A double-checked pattern that releases the lock before the slow Python call would be safer.
InterfaceConfigManager: Batch size reduction (500→200) and the 100 ms inter-batch delay() are straightforward and low-risk; delay() and yield() order is worth revisiting to get the GC hint out earlier.

Confidence Score: 3/5

Safe to merge for the crash-loop fix, but the @Synchronized-over-slow-call pattern in RoutingManager may not fully prevent binder thread exhaustion under concurrent cache-miss conditions.
The RemoteException guard in ServiceReticulumProtocol is correct and directly fixes the crash. The batch-size / delay changes in InterfaceConfigManager are low-risk. The main concern is RoutingManager.getPathTableHashes(): @Synchronized ensures only one Python call at a time but keeps every concurrent caller blocked on the lock for the full 39-second Python call duration. In a binder-thread context this could still saturate the 16-thread binder pool on the first call after each 30-second TTL window, though far less severely than before. The fix is a significant improvement, but the synchronization strategy warrants attention before treating the binder exhaustion as fully resolved.
RoutingManager.kt — the @Synchronized locking strategy for getPathTableHashes() deserves a second look to ensure lock is not held during the slow Python call.

Important Files Changed

Filename	Overview
app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt	Adds a 30-second cache with `@Synchronized` guard to `getPathTableHashes()`. The caching logic is sound, but `@Synchronized` holds the lock for the entire duration of the slow Python call — concurrent callers will still block for up to 39 s on a cold-cache event, risking binder thread pool exhaustion. `@Volatile` on the private fields is also redundant given the method-level synchronization.
app/src/main/java/com/lxmf/messenger/reticulum/protocol/ServiceReticulumProtocol.kt	Correctly catches `RemoteException` before the generic `Exception` handler in the `provideAlternativeRelay` callback, and wraps the error-recovery binder call in its own try-catch. Cleanly prevents the unhandled `DeadObjectException` crash loop described in COLUMBA-14 / COLUMBA-30.
app/src/main/java/com/lxmf/messenger/service/InterfaceConfigManager.kt	Reduces `restoreInBatches` default batch size from 500 → 200 and adds a 100 ms inter-batch delay to let the binder buffer drain. Change is well-motivated and safe; minor ordering concern where `delay()` precedes `yield()`, slightly delaying the GC hint for Chaquopy bridge objects.

Sequence Diagram

sequenceDiagram
    participant BT as Binder Thread
    participant RM as RoutingManager
    participant Cache as Path Table Cache
    participant PY as Python (get_path_table)

    Note over BT,PY: Before fix — every call blocks binder thread 39+ s
    BT->>RM: getPathTableHashes()
    RM->>PY: callAttr("get_path_table")
    PY-->>RM: result (39 s later)
    RM-->>BT: JSON string

    Note over BT,PY: After fix — cache hit (TTL 30 s)
    BT->>RM: getPathTableHashes()
    RM->>Cache: check (now - ts < 30 s?)
    Cache-->>RM: hit → return cached JSON
    RM-->>BT: JSON string (fast)

    Note over BT,PY: After fix — cache miss (first call / TTL expired)
    BT->>RM: getPathTableHashes() [@Synchronized]
    RM->>Cache: check → miss
    RM->>PY: callAttr("get_path_table")
    Note right of BT: Other callers BLOCK on lock<br/>for full Python call duration
    PY-->>RM: result
    RM->>Cache: update cachedPathTableJson + timestamp
    RM-->>BT: JSON string

    Note over BT,PY: ServiceReticulumProtocol — RemoteException guard
    BT->>ServiceReticulumProtocol: onAlternativeRelayRequested()
    ServiceReticulumProtocol->>Handler: handler(excludeHashes)
    Handler-->>ServiceReticulumProtocol: alternativeRelay
    ServiceReticulumProtocol->>IReticulumService: provideAlternativeRelay(relay)
    IReticulumService--xServiceReticulumProtocol: RemoteException (service dead)
    ServiceReticulumProtocol->>ServiceReticulumProtocol: catch RemoteException → log + return (no crash)

_{Last reviewed commit: fe689ec}

greptile-apps · 2026-03-11T01:43:39Z

app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt

+    @Synchronized
    fun getPathTableHashes(): String {
-        return wrapperManager.withWrapper { wrapper ->
-            try {
-                val result = wrapper.callAttr("get_path_table")
-                // Convert Python list to JSON array
-                val hashes = result.asList().map { it.toString() }
-                JSONArray(hashes).toString()
-            } catch (e: Exception) {
-                Log.e(TAG, "Error getting path table hashes", e)
-                "[]"
+        val now = System.currentTimeMillis()
+        val cached = cachedPathTableJson
+        if (cached != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) {
+            return cached
+        }
+
+        val result =
+            wrapperManager.withWrapper { wrapper ->
+                try {
+                    val pyResult = wrapper.callAttr("get_path_table")
+                    val hashes = pyResult.asList().map { it.toString() }
+                    JSONArray(hashes).toString()
+                } catch (e: Exception) {
+                    Log.e(TAG, "Error getting path table hashes", e)
+                    null
+                }
            }
-        } ?: run {
-            Log.w(TAG, "getPathTableHashes called but wrapper is null")
-            "[]"
+
+        if (result != null) {
+            cachedPathTableJson = result
+            cachedPathTableTimestamp = System.currentTimeMillis()
+            return result
        }
+        // Return stale cache on error if available, otherwise empty
+        return cached ?: "[]"


@Synchronized holds lock for entire Python call duration

@Synchronized acquires the lock on this for the entire body of getPathTableHashes(), including the wrapperManager.withWrapper Python call that can take 39+ seconds. If multiple binder threads call this method concurrently after the TTL expires, every caller serializes on the lock and blocks for the full duration of the Python call. With a 16-thread binder pool, all 16 threads could be stuck waiting here — reproducing the very binder-thread exhaustion the cache is meant to prevent, albeit less frequently (once every 30 s instead of on every call).

A safer pattern is to perform the slow call outside any lock, and only hold the lock for the cheap cache read/write:

fun getPathTableHashes(): String { // Fast path: read cache without holding lock long val now = System.currentTimeMillis() synchronized(this) { val c = cachedPathTableJson if (c != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) return c } // Slow path: Python call with NO lock held — threads may race here, but // at worst two concurrent callers do one Python call each (not 16). val fresh = wrapperManager.withWrapper { wrapper -> try { JSONArray(wrapper.callAttr("get_path_table").asList().map { it.toString() }).toString() } catch (e: Exception) { Log.e(TAG, "Error getting path table hashes", e) null } } return synchronized(this) { if (fresh != null) { cachedPathTableJson = fresh cachedPathTableTimestamp = System.currentTimeMillis() fresh } else { cachedPathTableJson ?: "[]" } } }

This allows at most a small burst of duplicate Python calls on a cold-cache event, but no thread waits more than microseconds to acquire the lock.

Prompt To Fix With AI

This is a comment left during a code review. Path: app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt Line: 112-138 Comment: **`@Synchronized` holds lock for entire Python call duration** `@Synchronized` acquires the lock on `this` for the entire body of `getPathTableHashes()`, including the `wrapperManager.withWrapper` Python call that can take 39+ seconds. If multiple binder threads call this method concurrently after the TTL expires, every caller serializes on the lock and blocks for the full duration of the Python call. With a 16-thread binder pool, all 16 threads could be stuck waiting here — reproducing the very binder-thread exhaustion the cache is meant to prevent, albeit less frequently (once every 30 s instead of on every call). A safer pattern is to perform the slow call **outside** any lock, and only hold the lock for the cheap cache read/write: ```kotlin fun getPathTableHashes(): String { // Fast path: read cache without holding lock long val now = System.currentTimeMillis() synchronized(this) { val c = cachedPathTableJson if (c != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) return c } // Slow path: Python call with NO lock held — threads may race here, but // at worst two concurrent callers do one Python call each (not 16). val fresh = wrapperManager.withWrapper { wrapper -> try { JSONArray(wrapper.callAttr("get_path_table").asList().map { it.toString() }).toString() } catch (e: Exception) { Log.e(TAG, "Error getting path table hashes", e) null } } return synchronized(this) { if (fresh != null) { cachedPathTableJson = fresh cachedPathTableTimestamp = System.currentTimeMillis() fresh } else { cachedPathTableJson ?: "[]" } } } ``` This allows at most a small burst of duplicate Python calls on a cold-cache event, but no thread waits more than microseconds to acquire the lock. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-11T01:43:40Z

app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt

+    @Volatile private var cachedPathTableJson: String? = null
+
+    @Volatile private var cachedPathTableTimestamp: Long = 0L


@Volatile is redundant given @Synchronized

cachedPathTableJson and cachedPathTableTimestamp are only ever read and written inside getPathTableHashes(), which is annotated @Synchronized. The synchronized block already provides the happens-before relationship and full memory visibility that @Volatile would add, so the annotations are redundant. Consider removing @Volatile to reduce visual noise and avoid implying these fields are safe to access unsynchronized from other code paths.

Suggested change

@Volatile private var cachedPathTableJson: String? = null

@Volatile private var cachedPathTableTimestamp: Long = 0L

// Cached path table result and timestamp to avoid repeated slow Python calls

private var cachedPathTableJson: String? = null

private var cachedPathTableTimestamp: Long = 0L

Prompt To Fix With AI

This is a comment left during a code review. Path: app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt Line: 33-35 Comment: **`@Volatile` is redundant given `@Synchronized`** `cachedPathTableJson` and `cachedPathTableTimestamp` are only ever read and written inside `getPathTableHashes()`, which is annotated `@Synchronized`. The `synchronized` block already provides the happens-before relationship and full memory visibility that `@Volatile` would add, so the annotations are redundant. Consider removing `@Volatile` to reduce visual noise and avoid implying these fields are safe to access unsynchronized from other code paths. ```suggestion // Cached path table result and timestamp to avoid repeated slow Python calls private var cachedPathTableJson: String? = null private var cachedPathTableTimestamp: Long = 0L ``` How can I resolve this? If you propose a fix, please make it concise.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-03-11T01:43:41Z

app/src/main/java/com/lxmf/messenger/service/InterfaceConfigManager.kt

+                            // Brief delay lets the binder buffer drain between batches,
+                            // preventing buffer saturation on devices with slow Python execution.
+                            kotlinx.coroutines.delay(100)
                            yield() // Let GC reclaim previous batch's bridge objects
                            fetchBatch(batchSize, offset)


yield() should precede delay() for GC timeliness

The comment directly above yield() says it exists to "let GC reclaim previous batch's bridge objects," but the 100 ms delay() runs first, keeping those Chaquopy bridge objects live for an additional 100 ms before the GC hint is issued. While delay() does suspend the coroutine and allows other work to run, it does not emit a GC hint. Swapping the order ensures the GC suggestion fires immediately after the batch is processed, and the buffer-drain pause follows:

Suggested change

// Brief delay lets the binder buffer drain between batches,

// preventing buffer saturation on devices with slow Python execution.

kotlinx.coroutines.delay(100)

yield() // Let GC reclaim previous batch's bridge objects

fetchBatch(batchSize, offset)

yield() // Let GC reclaim previous batch's bridge objects

// Brief delay lets the binder buffer drain between batches,

// preventing buffer saturation on devices with slow Python execution.

kotlinx.coroutines.delay(100)

Prompt To Fix With AI

This is a comment left during a code review. Path: app/src/main/java/com/lxmf/messenger/service/InterfaceConfigManager.kt Line: 426-430 Comment: **`yield()` should precede `delay()` for GC timeliness** The comment directly above `yield()` says it exists to "let GC reclaim previous batch's bridge objects," but the 100 ms `delay()` runs first, keeping those Chaquopy bridge objects live for an additional 100 ms before the GC hint is issued. While `delay()` does suspend the coroutine and allows other work to run, it does not emit a GC hint. Swapping the order ensures the GC suggestion fires immediately after the batch is processed, and the buffer-drain pause follows: ```suggestion yield() // Let GC reclaim previous batch's bridge objects // Brief delay lets the binder buffer drain between batches, // preventing buffer saturation on devices with slow Python execution. kotlinx.coroutines.delay(100) ``` How can I resolve this? If you propose a fix, please make it concise.

torlando-tech and others added 2 commits March 10, 2026 21:39

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

torlando-tech merged commit 8e81e15 into release/v0.9.x Mar 11, 2026
2 checks passed

torlando-tech deleted the fix/binder-buffer-exhaustion-647-release branch March 11, 2026 01:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent binder buffer exhaustion causing crash loop#654

fix: prevent binder buffer exhaustion causing crash loop#654
torlando-tech merged 2 commits intorelease/v0.9.xfrom
fix/binder-buffer-exhaustion-647-release

torlando-tech commented Mar 11, 2026

Uh oh!

greptile-apps bot commented Mar 11, 2026

Uh oh!

greptile-apps bot Mar 11, 2026

Uh oh!

greptile-apps bot Mar 11, 2026

Uh oh!

greptile-apps bot Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@Volatile private var cachedPathTableJson: String? = null

		@Volatile private var cachedPathTableTimestamp: Long = 0L

Uh oh!

Conversation

torlando-tech commented Mar 11, 2026

Summary

Root Cause

Test plan

Uh oh!

greptile-apps bot commented Mar 11, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant