fix: prevent binder buffer exhaustion causing crash loop#654
fix: prevent binder buffer exhaustion causing crash loop#654torlando-tech merged 2 commits intorelease/v0.9.xfrom
Conversation
Addresses binder IPC saturation that leads to cascading DeadObjectException crashes in Room's multi-instance invalidation. 1. Catch RemoteException on provideAlternativeRelay() binder callbacks to prevent unhandled DeadObjectException fatal crashes (COLUMBA-14, COLUMBA-30) 2. Cache getPathTableHashes() results for 30s in RoutingManager to avoid blocking binder threads for 39+ seconds on large path tables, which saturates the 1MB per-process binder buffer 3. Reduce identity restore batch size from 500 to 200 and add 100ms delay between batches to let the binder buffer drain during initialization Fixes #647 Fixes COLUMBA-14 Fixes COLUMBA-30 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add @synchronized to prevent concurrent callers from both bypassing the cache and invoking the slow Python call simultaneously - Only cache successful results; on error return stale cache if available, otherwise "[]" — prevents 30s lockout on transient failures Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR addresses a binder buffer exhaustion crash loop by: (1) catching Key changes:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant BT as Binder Thread
participant RM as RoutingManager
participant Cache as Path Table Cache
participant PY as Python (get_path_table)
Note over BT,PY: Before fix — every call blocks binder thread 39+ s
BT->>RM: getPathTableHashes()
RM->>PY: callAttr("get_path_table")
PY-->>RM: result (39 s later)
RM-->>BT: JSON string
Note over BT,PY: After fix — cache hit (TTL 30 s)
BT->>RM: getPathTableHashes()
RM->>Cache: check (now - ts < 30 s?)
Cache-->>RM: hit → return cached JSON
RM-->>BT: JSON string (fast)
Note over BT,PY: After fix — cache miss (first call / TTL expired)
BT->>RM: getPathTableHashes() [@Synchronized]
RM->>Cache: check → miss
RM->>PY: callAttr("get_path_table")
Note right of BT: Other callers BLOCK on lock<br/>for full Python call duration
PY-->>RM: result
RM->>Cache: update cachedPathTableJson + timestamp
RM-->>BT: JSON string
Note over BT,PY: ServiceReticulumProtocol — RemoteException guard
BT->>ServiceReticulumProtocol: onAlternativeRelayRequested()
ServiceReticulumProtocol->>Handler: handler(excludeHashes)
Handler-->>ServiceReticulumProtocol: alternativeRelay
ServiceReticulumProtocol->>IReticulumService: provideAlternativeRelay(relay)
IReticulumService--xServiceReticulumProtocol: RemoteException (service dead)
ServiceReticulumProtocol->>ServiceReticulumProtocol: catch RemoteException → log + return (no crash)
Last reviewed commit: fe689ec |
| @Synchronized | ||
| fun getPathTableHashes(): String { | ||
| return wrapperManager.withWrapper { wrapper -> | ||
| try { | ||
| val result = wrapper.callAttr("get_path_table") | ||
| // Convert Python list to JSON array | ||
| val hashes = result.asList().map { it.toString() } | ||
| JSONArray(hashes).toString() | ||
| } catch (e: Exception) { | ||
| Log.e(TAG, "Error getting path table hashes", e) | ||
| "[]" | ||
| val now = System.currentTimeMillis() | ||
| val cached = cachedPathTableJson | ||
| if (cached != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) { | ||
| return cached | ||
| } | ||
|
|
||
| val result = | ||
| wrapperManager.withWrapper { wrapper -> | ||
| try { | ||
| val pyResult = wrapper.callAttr("get_path_table") | ||
| val hashes = pyResult.asList().map { it.toString() } | ||
| JSONArray(hashes).toString() | ||
| } catch (e: Exception) { | ||
| Log.e(TAG, "Error getting path table hashes", e) | ||
| null | ||
| } | ||
| } | ||
| } ?: run { | ||
| Log.w(TAG, "getPathTableHashes called but wrapper is null") | ||
| "[]" | ||
|
|
||
| if (result != null) { | ||
| cachedPathTableJson = result | ||
| cachedPathTableTimestamp = System.currentTimeMillis() | ||
| return result | ||
| } | ||
| // Return stale cache on error if available, otherwise empty | ||
| return cached ?: "[]" |
There was a problem hiding this comment.
@Synchronized holds lock for entire Python call duration
@Synchronized acquires the lock on this for the entire body of getPathTableHashes(), including the wrapperManager.withWrapper Python call that can take 39+ seconds. If multiple binder threads call this method concurrently after the TTL expires, every caller serializes on the lock and blocks for the full duration of the Python call. With a 16-thread binder pool, all 16 threads could be stuck waiting here — reproducing the very binder-thread exhaustion the cache is meant to prevent, albeit less frequently (once every 30 s instead of on every call).
A safer pattern is to perform the slow call outside any lock, and only hold the lock for the cheap cache read/write:
fun getPathTableHashes(): String {
// Fast path: read cache without holding lock long
val now = System.currentTimeMillis()
synchronized(this) {
val c = cachedPathTableJson
if (c != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) return c
}
// Slow path: Python call with NO lock held — threads may race here, but
// at worst two concurrent callers do one Python call each (not 16).
val fresh = wrapperManager.withWrapper { wrapper ->
try {
JSONArray(wrapper.callAttr("get_path_table").asList().map { it.toString() }).toString()
} catch (e: Exception) {
Log.e(TAG, "Error getting path table hashes", e)
null
}
}
return synchronized(this) {
if (fresh != null) {
cachedPathTableJson = fresh
cachedPathTableTimestamp = System.currentTimeMillis()
fresh
} else {
cachedPathTableJson ?: "[]"
}
}
}This allows at most a small burst of duplicate Python calls on a cold-cache event, but no thread waits more than microseconds to acquire the lock.
Prompt To Fix With AI
This is a comment left during a code review.
Path: app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt
Line: 112-138
Comment:
**`@Synchronized` holds lock for entire Python call duration**
`@Synchronized` acquires the lock on `this` for the entire body of `getPathTableHashes()`, including the `wrapperManager.withWrapper` Python call that can take 39+ seconds. If multiple binder threads call this method concurrently after the TTL expires, every caller serializes on the lock and blocks for the full duration of the Python call. With a 16-thread binder pool, all 16 threads could be stuck waiting here — reproducing the very binder-thread exhaustion the cache is meant to prevent, albeit less frequently (once every 30 s instead of on every call).
A safer pattern is to perform the slow call **outside** any lock, and only hold the lock for the cheap cache read/write:
```kotlin
fun getPathTableHashes(): String {
// Fast path: read cache without holding lock long
val now = System.currentTimeMillis()
synchronized(this) {
val c = cachedPathTableJson
if (c != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) return c
}
// Slow path: Python call with NO lock held — threads may race here, but
// at worst two concurrent callers do one Python call each (not 16).
val fresh = wrapperManager.withWrapper { wrapper ->
try {
JSONArray(wrapper.callAttr("get_path_table").asList().map { it.toString() }).toString()
} catch (e: Exception) {
Log.e(TAG, "Error getting path table hashes", e)
null
}
}
return synchronized(this) {
if (fresh != null) {
cachedPathTableJson = fresh
cachedPathTableTimestamp = System.currentTimeMillis()
fresh
} else {
cachedPathTableJson ?: "[]"
}
}
}
```
This allows at most a small burst of duplicate Python calls on a cold-cache event, but no thread waits more than microseconds to acquire the lock.
How can I resolve this? If you propose a fix, please make it concise.| @Volatile private var cachedPathTableJson: String? = null | ||
|
|
||
| @Volatile private var cachedPathTableTimestamp: Long = 0L |
There was a problem hiding this comment.
@Volatile is redundant given @Synchronized
cachedPathTableJson and cachedPathTableTimestamp are only ever read and written inside getPathTableHashes(), which is annotated @Synchronized. The synchronized block already provides the happens-before relationship and full memory visibility that @Volatile would add, so the annotations are redundant. Consider removing @Volatile to reduce visual noise and avoid implying these fields are safe to access unsynchronized from other code paths.
| @Volatile private var cachedPathTableJson: String? = null | |
| @Volatile private var cachedPathTableTimestamp: Long = 0L | |
| // Cached path table result and timestamp to avoid repeated slow Python calls | |
| private var cachedPathTableJson: String? = null | |
| private var cachedPathTableTimestamp: Long = 0L |
Prompt To Fix With AI
This is a comment left during a code review.
Path: app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt
Line: 33-35
Comment:
**`@Volatile` is redundant given `@Synchronized`**
`cachedPathTableJson` and `cachedPathTableTimestamp` are only ever read and written inside `getPathTableHashes()`, which is annotated `@Synchronized`. The `synchronized` block already provides the happens-before relationship and full memory visibility that `@Volatile` would add, so the annotations are redundant. Consider removing `@Volatile` to reduce visual noise and avoid implying these fields are safe to access unsynchronized from other code paths.
```suggestion
// Cached path table result and timestamp to avoid repeated slow Python calls
private var cachedPathTableJson: String? = null
private var cachedPathTableTimestamp: Long = 0L
```
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| // Brief delay lets the binder buffer drain between batches, | ||
| // preventing buffer saturation on devices with slow Python execution. | ||
| kotlinx.coroutines.delay(100) | ||
| yield() // Let GC reclaim previous batch's bridge objects | ||
| fetchBatch(batchSize, offset) |
There was a problem hiding this comment.
yield() should precede delay() for GC timeliness
The comment directly above yield() says it exists to "let GC reclaim previous batch's bridge objects," but the 100 ms delay() runs first, keeping those Chaquopy bridge objects live for an additional 100 ms before the GC hint is issued. While delay() does suspend the coroutine and allows other work to run, it does not emit a GC hint. Swapping the order ensures the GC suggestion fires immediately after the batch is processed, and the buffer-drain pause follows:
| // Brief delay lets the binder buffer drain between batches, | |
| // preventing buffer saturation on devices with slow Python execution. | |
| kotlinx.coroutines.delay(100) | |
| yield() // Let GC reclaim previous batch's bridge objects | |
| fetchBatch(batchSize, offset) | |
| yield() // Let GC reclaim previous batch's bridge objects | |
| // Brief delay lets the binder buffer drain between batches, | |
| // preventing buffer saturation on devices with slow Python execution. | |
| kotlinx.coroutines.delay(100) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: app/src/main/java/com/lxmf/messenger/service/InterfaceConfigManager.kt
Line: 426-430
Comment:
**`yield()` should precede `delay()` for GC timeliness**
The comment directly above `yield()` says it exists to "let GC reclaim previous batch's bridge objects," but the 100 ms `delay()` runs first, keeping those Chaquopy bridge objects live for an additional 100 ms before the GC hint is issued. While `delay()` does suspend the coroutine and allows other work to run, it does not emit a GC hint. Swapping the order ensures the GC suggestion fires immediately after the batch is processed, and the buffer-drain pause follows:
```suggestion
yield() // Let GC reclaim previous batch's bridge objects
// Brief delay lets the binder buffer drain between batches,
// preventing buffer saturation on devices with slow Python execution.
kotlinx.coroutines.delay(100)
```
How can I resolve this? If you propose a fix, please make it concise.
Summary
RemoteExceptiononprovideAlternativeRelay()binder callbacks to prevent unhandledDeadObjectExceptionfatal crashes (COLUMBA-14: 95 events/27 users, COLUMBA-30: 4 events/4 users)getPathTableHashes()results for 30s inRoutingManager(thread-safe, errors not cached) to avoid blocking binder threads for 39+ seconds on large path tables, which saturates the 1MB per-process binder bufferRoot Cause
Binder transactions to
IReticulumService(particularlygetPathTableHashes, code 19) were taking 39-44 seconds, holding binder threads and filling the 1MB per-process binder buffer. Once saturated, Room'senableMultiInstanceInvalidation()binder IPC also fails withDeadObjectException, causing a crash loop that doesn't self-recover.Cherry-pick of ba43d47, 67c3466 from
fix/binder-buffer-exhaustion-647(PR #652).Test plan
compileNoSentryDebugKotlinbuilds cleanAnnounceStreamViewModelTestpassesFixes #647
Fixes COLUMBA-14
Fixes COLUMBA-30
🤖 Generated with Claude Code