Skip to content

fix: prevent binder buffer exhaustion causing crash loop#654

Merged
torlando-tech merged 2 commits intorelease/v0.9.xfrom
fix/binder-buffer-exhaustion-647-release
Mar 11, 2026
Merged

fix: prevent binder buffer exhaustion causing crash loop#654
torlando-tech merged 2 commits intorelease/v0.9.xfrom
fix/binder-buffer-exhaustion-647-release

Conversation

@torlando-tech
Copy link
Copy Markdown
Owner

Summary

  • Catch RemoteException on provideAlternativeRelay() binder callbacks to prevent unhandled DeadObjectException fatal crashes (COLUMBA-14: 95 events/27 users, COLUMBA-30: 4 events/4 users)
  • Cache getPathTableHashes() results for 30s in RoutingManager (thread-safe, errors not cached) to avoid blocking binder threads for 39+ seconds on large path tables, which saturates the 1MB per-process binder buffer
  • Reduce identity restore batch size from 500→200 and add 100ms delay between batches to let the binder buffer drain during initialization

Root Cause

Binder transactions to IReticulumService (particularly getPathTableHashes, code 19) were taking 39-44 seconds, holding binder threads and filling the 1MB per-process binder buffer. Once saturated, Room's enableMultiInstanceInvalidation() binder IPC also fails with DeadObjectException, causing a crash loop that doesn't self-recover.

Cherry-pick of ba43d47, 67c3466 from fix/binder-buffer-exhaustion-647 (PR #652).

Test plan

  • compileNoSentryDebugKotlin builds clean
  • AnnounceStreamViewModelTest passes
  • Install on device with large path table (1000+ entries) and verify no binder warnings in logcat
  • Verify identity restoration completes during cold start without ANR

Fixes #647
Fixes COLUMBA-14
Fixes COLUMBA-30

🤖 Generated with Claude Code

torlando-tech and others added 2 commits March 10, 2026 21:39
Addresses binder IPC saturation that leads to cascading
DeadObjectException crashes in Room's multi-instance invalidation.

1. Catch RemoteException on provideAlternativeRelay() binder callbacks
   to prevent unhandled DeadObjectException fatal crashes (COLUMBA-14,
   COLUMBA-30)

2. Cache getPathTableHashes() results for 30s in RoutingManager to
   avoid blocking binder threads for 39+ seconds on large path tables,
   which saturates the 1MB per-process binder buffer

3. Reduce identity restore batch size from 500 to 200 and add 100ms
   delay between batches to let the binder buffer drain during
   initialization

Fixes #647
Fixes COLUMBA-14
Fixes COLUMBA-30

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add @synchronized to prevent concurrent callers from both bypassing
  the cache and invoking the slow Python call simultaneously
- Only cache successful results; on error return stale cache if
  available, otherwise "[]" — prevents 30s lockout on transient failures

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR addresses a binder buffer exhaustion crash loop by: (1) catching RemoteException on binder callbacks in ServiceReticulumProtocol, (2) caching slow getPathTableHashes() Python calls for 30 s in RoutingManager, and (3) reducing identity-restore batch size and adding inter-batch delays in InterfaceConfigManager. The fixes are well-targeted and directly address the crash scenario reported in COLUMBA-14 and COLUMBA-30.

Key changes:

  • ServiceReticulumProtocol: RemoteException is now caught before the generic Exception handler in the provideAlternativeRelay callback, and the error-recovery binder call is also guarded — cleanly breaking the crash loop.
  • RoutingManager: A 30-second TTL cache is added to getPathTableHashes(). The cache logic is correct, but @Synchronized holds the lock for the entire Python call (up to 39 s), meaning concurrent callers on a cold-cache event will all block on the lock — potentially re-introducing binder-thread exhaustion at TTL-expiry boundaries. A double-checked pattern that releases the lock before the slow Python call would be safer.
  • InterfaceConfigManager: Batch size reduction (500→200) and the 100 ms inter-batch delay() are straightforward and low-risk; delay() and yield() order is worth revisiting to get the GC hint out earlier.

Confidence Score: 3/5

  • Safe to merge for the crash-loop fix, but the @Synchronized-over-slow-call pattern in RoutingManager may not fully prevent binder thread exhaustion under concurrent cache-miss conditions.
  • The RemoteException guard in ServiceReticulumProtocol is correct and directly fixes the crash. The batch-size / delay changes in InterfaceConfigManager are low-risk. The main concern is RoutingManager.getPathTableHashes(): @Synchronized ensures only one Python call at a time but keeps every concurrent caller blocked on the lock for the full 39-second Python call duration. In a binder-thread context this could still saturate the 16-thread binder pool on the first call after each 30-second TTL window, though far less severely than before. The fix is a significant improvement, but the synchronization strategy warrants attention before treating the binder exhaustion as fully resolved.
  • RoutingManager.kt — the @Synchronized locking strategy for getPathTableHashes() deserves a second look to ensure lock is not held during the slow Python call.

Important Files Changed

Filename Overview
app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt Adds a 30-second cache with @Synchronized guard to getPathTableHashes(). The caching logic is sound, but @Synchronized holds the lock for the entire duration of the slow Python call — concurrent callers will still block for up to 39 s on a cold-cache event, risking binder thread pool exhaustion. @Volatile on the private fields is also redundant given the method-level synchronization.
app/src/main/java/com/lxmf/messenger/reticulum/protocol/ServiceReticulumProtocol.kt Correctly catches RemoteException before the generic Exception handler in the provideAlternativeRelay callback, and wraps the error-recovery binder call in its own try-catch. Cleanly prevents the unhandled DeadObjectException crash loop described in COLUMBA-14 / COLUMBA-30.
app/src/main/java/com/lxmf/messenger/service/InterfaceConfigManager.kt Reduces restoreInBatches default batch size from 500 → 200 and adds a 100 ms inter-batch delay to let the binder buffer drain. Change is well-motivated and safe; minor ordering concern where delay() precedes yield(), slightly delaying the GC hint for Chaquopy bridge objects.

Sequence Diagram

sequenceDiagram
    participant BT as Binder Thread
    participant RM as RoutingManager
    participant Cache as Path Table Cache
    participant PY as Python (get_path_table)

    Note over BT,PY: Before fix — every call blocks binder thread 39+ s
    BT->>RM: getPathTableHashes()
    RM->>PY: callAttr("get_path_table")
    PY-->>RM: result (39 s later)
    RM-->>BT: JSON string

    Note over BT,PY: After fix — cache hit (TTL 30 s)
    BT->>RM: getPathTableHashes()
    RM->>Cache: check (now - ts < 30 s?)
    Cache-->>RM: hit → return cached JSON
    RM-->>BT: JSON string (fast)

    Note over BT,PY: After fix — cache miss (first call / TTL expired)
    BT->>RM: getPathTableHashes() [@Synchronized]
    RM->>Cache: check → miss
    RM->>PY: callAttr("get_path_table")
    Note right of BT: Other callers BLOCK on lock<br/>for full Python call duration
    PY-->>RM: result
    RM->>Cache: update cachedPathTableJson + timestamp
    RM-->>BT: JSON string

    Note over BT,PY: ServiceReticulumProtocol — RemoteException guard
    BT->>ServiceReticulumProtocol: onAlternativeRelayRequested()
    ServiceReticulumProtocol->>Handler: handler(excludeHashes)
    Handler-->>ServiceReticulumProtocol: alternativeRelay
    ServiceReticulumProtocol->>IReticulumService: provideAlternativeRelay(relay)
    IReticulumService--xServiceReticulumProtocol: RemoteException (service dead)
    ServiceReticulumProtocol->>ServiceReticulumProtocol: catch RemoteException → log + return (no crash)
Loading

Last reviewed commit: fe689ec

Comment on lines +112 to +138
@Synchronized
fun getPathTableHashes(): String {
return wrapperManager.withWrapper { wrapper ->
try {
val result = wrapper.callAttr("get_path_table")
// Convert Python list to JSON array
val hashes = result.asList().map { it.toString() }
JSONArray(hashes).toString()
} catch (e: Exception) {
Log.e(TAG, "Error getting path table hashes", e)
"[]"
val now = System.currentTimeMillis()
val cached = cachedPathTableJson
if (cached != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) {
return cached
}

val result =
wrapperManager.withWrapper { wrapper ->
try {
val pyResult = wrapper.callAttr("get_path_table")
val hashes = pyResult.asList().map { it.toString() }
JSONArray(hashes).toString()
} catch (e: Exception) {
Log.e(TAG, "Error getting path table hashes", e)
null
}
}
} ?: run {
Log.w(TAG, "getPathTableHashes called but wrapper is null")
"[]"

if (result != null) {
cachedPathTableJson = result
cachedPathTableTimestamp = System.currentTimeMillis()
return result
}
// Return stale cache on error if available, otherwise empty
return cached ?: "[]"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Synchronized holds lock for entire Python call duration

@Synchronized acquires the lock on this for the entire body of getPathTableHashes(), including the wrapperManager.withWrapper Python call that can take 39+ seconds. If multiple binder threads call this method concurrently after the TTL expires, every caller serializes on the lock and blocks for the full duration of the Python call. With a 16-thread binder pool, all 16 threads could be stuck waiting here — reproducing the very binder-thread exhaustion the cache is meant to prevent, albeit less frequently (once every 30 s instead of on every call).

A safer pattern is to perform the slow call outside any lock, and only hold the lock for the cheap cache read/write:

fun getPathTableHashes(): String {
    // Fast path: read cache without holding lock long
    val now = System.currentTimeMillis()
    synchronized(this) {
        val c = cachedPathTableJson
        if (c != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) return c
    }

    // Slow path: Python call with NO lock held — threads may race here, but
    // at worst two concurrent callers do one Python call each (not 16).
    val fresh = wrapperManager.withWrapper { wrapper ->
        try {
            JSONArray(wrapper.callAttr("get_path_table").asList().map { it.toString() }).toString()
        } catch (e: Exception) {
            Log.e(TAG, "Error getting path table hashes", e)
            null
        }
    }

    return synchronized(this) {
        if (fresh != null) {
            cachedPathTableJson = fresh
            cachedPathTableTimestamp = System.currentTimeMillis()
            fresh
        } else {
            cachedPathTableJson ?: "[]"
        }
    }
}

This allows at most a small burst of duplicate Python calls on a cold-cache event, but no thread waits more than microseconds to acquire the lock.

Prompt To Fix With AI
This is a comment left during a code review.
Path: app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt
Line: 112-138

Comment:
**`@Synchronized` holds lock for entire Python call duration**

`@Synchronized` acquires the lock on `this` for the entire body of `getPathTableHashes()`, including the `wrapperManager.withWrapper` Python call that can take 39+ seconds. If multiple binder threads call this method concurrently after the TTL expires, every caller serializes on the lock and blocks for the full duration of the Python call. With a 16-thread binder pool, all 16 threads could be stuck waiting here — reproducing the very binder-thread exhaustion the cache is meant to prevent, albeit less frequently (once every 30 s instead of on every call).

A safer pattern is to perform the slow call **outside** any lock, and only hold the lock for the cheap cache read/write:

```kotlin
fun getPathTableHashes(): String {
    // Fast path: read cache without holding lock long
    val now = System.currentTimeMillis()
    synchronized(this) {
        val c = cachedPathTableJson
        if (c != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) return c
    }

    // Slow path: Python call with NO lock held — threads may race here, but
    // at worst two concurrent callers do one Python call each (not 16).
    val fresh = wrapperManager.withWrapper { wrapper ->
        try {
            JSONArray(wrapper.callAttr("get_path_table").asList().map { it.toString() }).toString()
        } catch (e: Exception) {
            Log.e(TAG, "Error getting path table hashes", e)
            null
        }
    }

    return synchronized(this) {
        if (fresh != null) {
            cachedPathTableJson = fresh
            cachedPathTableTimestamp = System.currentTimeMillis()
            fresh
        } else {
            cachedPathTableJson ?: "[]"
        }
    }
}
```

This allows at most a small burst of duplicate Python calls on a cold-cache event, but no thread waits more than microseconds to acquire the lock.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +33 to +35
@Volatile private var cachedPathTableJson: String? = null

@Volatile private var cachedPathTableTimestamp: Long = 0L
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Volatile is redundant given @Synchronized

cachedPathTableJson and cachedPathTableTimestamp are only ever read and written inside getPathTableHashes(), which is annotated @Synchronized. The synchronized block already provides the happens-before relationship and full memory visibility that @Volatile would add, so the annotations are redundant. Consider removing @Volatile to reduce visual noise and avoid implying these fields are safe to access unsynchronized from other code paths.

Suggested change
@Volatile private var cachedPathTableJson: String? = null
@Volatile private var cachedPathTableTimestamp: Long = 0L
// Cached path table result and timestamp to avoid repeated slow Python calls
private var cachedPathTableJson: String? = null
private var cachedPathTableTimestamp: Long = 0L
Prompt To Fix With AI
This is a comment left during a code review.
Path: app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt
Line: 33-35

Comment:
**`@Volatile` is redundant given `@Synchronized`**

`cachedPathTableJson` and `cachedPathTableTimestamp` are only ever read and written inside `getPathTableHashes()`, which is annotated `@Synchronized`. The `synchronized` block already provides the happens-before relationship and full memory visibility that `@Volatile` would add, so the annotations are redundant. Consider removing `@Volatile` to reduce visual noise and avoid implying these fields are safe to access unsynchronized from other code paths.

```suggestion
    // Cached path table result and timestamp to avoid repeated slow Python calls
    private var cachedPathTableJson: String? = null

    private var cachedPathTableTimestamp: Long = 0L
```

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +426 to 430
// Brief delay lets the binder buffer drain between batches,
// preventing buffer saturation on devices with slow Python execution.
kotlinx.coroutines.delay(100)
yield() // Let GC reclaim previous batch's bridge objects
fetchBatch(batchSize, offset)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yield() should precede delay() for GC timeliness

The comment directly above yield() says it exists to "let GC reclaim previous batch's bridge objects," but the 100 ms delay() runs first, keeping those Chaquopy bridge objects live for an additional 100 ms before the GC hint is issued. While delay() does suspend the coroutine and allows other work to run, it does not emit a GC hint. Swapping the order ensures the GC suggestion fires immediately after the batch is processed, and the buffer-drain pause follows:

Suggested change
// Brief delay lets the binder buffer drain between batches,
// preventing buffer saturation on devices with slow Python execution.
kotlinx.coroutines.delay(100)
yield() // Let GC reclaim previous batch's bridge objects
fetchBatch(batchSize, offset)
yield() // Let GC reclaim previous batch's bridge objects
// Brief delay lets the binder buffer drain between batches,
// preventing buffer saturation on devices with slow Python execution.
kotlinx.coroutines.delay(100)
Prompt To Fix With AI
This is a comment left during a code review.
Path: app/src/main/java/com/lxmf/messenger/service/InterfaceConfigManager.kt
Line: 426-430

Comment:
**`yield()` should precede `delay()` for GC timeliness**

The comment directly above `yield()` says it exists to "let GC reclaim previous batch's bridge objects," but the 100 ms `delay()` runs first, keeping those Chaquopy bridge objects live for an additional 100 ms before the GC hint is issued. While `delay()` does suspend the coroutine and allows other work to run, it does not emit a GC hint. Swapping the order ensures the GC suggestion fires immediately after the batch is processed, and the buffer-drain pause follows:

```suggestion
                            yield() // Let GC reclaim previous batch's bridge objects
                            // Brief delay lets the binder buffer drain between batches,
                            // preventing buffer saturation on devices with slow Python execution.
                            kotlinx.coroutines.delay(100)
```

How can I resolve this? If you propose a fix, please make it concise.

@torlando-tech torlando-tech merged commit 8e81e15 into release/v0.9.x Mar 11, 2026
2 checks passed
@torlando-tech torlando-tech deleted the fix/binder-buffer-exhaustion-647-release branch March 11, 2026 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant