fix: prevent binder buffer exhaustion causing crash loop#652
fix: prevent binder buffer exhaustion causing crash loop#652torlando-tech merged 2 commits intomainfrom
Conversation
Addresses binder IPC saturation that leads to cascading DeadObjectException crashes in Room's multi-instance invalidation. 1. Catch RemoteException on provideAlternativeRelay() binder callbacks to prevent unhandled DeadObjectException fatal crashes (COLUMBA-14, COLUMBA-30) 2. Cache getPathTableHashes() results for 30s in RoutingManager to avoid blocking binder threads for 39+ seconds on large path tables, which saturates the 1MB per-process binder buffer 3. Reduce identity restore batch size from 500 to 200 and add 100ms delay between batches to let the binder buffer drain during initialization Fixes #647 Fixes COLUMBA-14 Fixes COLUMBA-30 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR addresses a binder-buffer-exhaustion crash loop (COLUMBA-14, COLUMBA-30) through three coordinated changes: catching
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller as Caller (ViewModel/Coroutine)
participant RM as RoutingManager
participant Cache as In-Memory Cache
participant PY as Python (Chaquopy)
participant Binder as Android Binder (1MB buffer)
Note over Caller,Binder: Cache HIT path (within 30s TTL)
Caller->>RM: getPathTableHashes()
RM->>Cache: read cachedPathTableJson
Cache-->>RM: non-null & fresh
RM-->>Caller: return cached JSON (fast)
Note over Caller,Binder: Cache MISS path (cold start / expired)
Caller->>RM: getPathTableHashes() [@Synchronized — blocks thread]
RM->>Cache: read cachedPathTableJson
Cache-->>RM: null or stale
RM->>PY: get_path_table() [up to 39s]
PY-->>Binder: large result over IPC
Binder-->>PY: ok
PY-->>RM: JSONArray string
RM->>Cache: write result + timestamp
RM-->>Caller: return fresh JSON
Note over Caller,Binder: Error / wrapper null path
RM->>PY: get_path_table()
PY-->>RM: Exception / null
RM->>Cache: read stale cachedPathTableJson
Cache-->>RM: stale or null
RM-->>Caller: return stale JSON or "[]"
Note over Caller,Binder: provideAlternativeRelay binder callback
Caller->>RM: alternativeRelayHandler(excludeHashes)
RM-->>Caller: alternativeRelay
Caller->>Binder: provideAlternativeRelay(alternativeRelay)
alt Service alive
Binder-->>Caller: ok
else DeadObjectException / RemoteException
Binder-->>Caller: RemoteException (caught, logged, swallowed)
end
Last reviewed commit: 67c3466 |
app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt
Outdated
Show resolved
Hide resolved
app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt
Outdated
Show resolved
Hide resolved
- Add @synchronized to prevent concurrent callers from both bypassing the cache and invoking the slow Python call simultaneously - Only cache successful results; on error return stale cache if available, otherwise "[]" — prevents 30s lockout on transient failures Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| } else { | ||
| // Brief delay lets the binder buffer drain between batches, | ||
| // preventing buffer saturation on devices with slow Python execution. | ||
| kotlinx.coroutines.delay(100) |
There was a problem hiding this comment.
Fully-qualified delay call is redundant given existing import
InterfaceConfigManager.kt already imports kotlinx.coroutines.delay at the top of the file (line 22), so the fully-qualified reference kotlinx.coroutines.delay(100) is unnecessary and inconsistent with how yield() is used just below it.
| kotlinx.coroutines.delay(100) | |
| delay(100) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: app/src/main/java/com/lxmf/messenger/service/InterfaceConfigManager.kt
Line: 428
Comment:
**Fully-qualified `delay` call is redundant given existing import**
`InterfaceConfigManager.kt` already imports `kotlinx.coroutines.delay` at the top of the file (line 22), so the fully-qualified reference `kotlinx.coroutines.delay(100)` is unnecessary and inconsistent with how `yield()` is used just below it.
```suggestion
delay(100)
```
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| kotlinx.coroutines.delay(100) | ||
| yield() // Let GC reclaim previous batch's bridge objects |
There was a problem hiding this comment.
yield() is redundant after delay()
kotlinx.coroutines.delay(100) is already a suspension point — it suspends the current coroutine for 100ms, during which the scheduler is free to run any other pending coroutines. Calling yield() immediately after provides no additional benefit; the scheduler already had 100ms to work with. The original purpose of yield() (letting GC reclaim previous batch bridge objects) is already served by the suspension inside delay.
| kotlinx.coroutines.delay(100) | |
| yield() // Let GC reclaim previous batch's bridge objects | |
| delay(100) // Let binder buffer drain and GC reclaim previous batch's bridge objects |
Prompt To Fix With AI
This is a comment left during a code review.
Path: app/src/main/java/com/lxmf/messenger/service/InterfaceConfigManager.kt
Line: 428-429
Comment:
**`yield()` is redundant after `delay()`**
`kotlinx.coroutines.delay(100)` is already a suspension point — it suspends the current coroutine for 100ms, during which the scheduler is free to run any other pending coroutines. Calling `yield()` immediately after provides no additional benefit; the scheduler already had 100ms to work with. The original purpose of `yield()` (letting GC reclaim previous batch bridge objects) is already served by the suspension inside `delay`.
```suggestion
delay(100) // Let binder buffer drain and GC reclaim previous batch's bridge objects
```
How can I resolve this? If you propose a fix, please make it concise.| @Synchronized | ||
| fun getPathTableHashes(): String { |
There was a problem hiding this comment.
@Synchronized can starve coroutine dispatcher threads on a cache miss
@Synchronized makes the JVM monitor the RoutingManager instance. If getPathTableHashes() is called concurrently from coroutines running on Dispatchers.IO while the cache is cold/expired, every waiting coroutine pins a thread pool thread at the lock boundary for the full duration of the Python call (up to ~39 s). With enough concurrent callers this can exhaust the Dispatchers.IO thread pool (64 threads by default on Android) and stall other IO operations.
A coroutine-friendly alternative is a kotlinx.coroutines.sync.Mutex, which suspends rather than blocks the thread:
private val cacheMutex = Mutex()
suspend fun getPathTableHashes(): String {
val now = System.currentTimeMillis()
val cached = cachedPathTableJson
if (cached != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) {
return cached
}
return cacheMutex.withLock {
// Re-check inside the lock in case another coroutine already refreshed
val rechecked = cachedPathTableJson
if (rechecked != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) {
return@withLock rechecked
}
val result = wrapperManager.withWrapper { wrapper ->
try {
val pyResult = wrapper.callAttr("get_path_table")
val hashes = pyResult.asList().map { it.toString() }
JSONArray(hashes).toString()
} catch (e: Exception) {
Log.e(TAG, "Error getting path table hashes", e)
null
}
}
if (result != null) {
cachedPathTableJson = result
cachedPathTableTimestamp = System.currentTimeMillis()
result
} else {
cachedPathTableJson ?: "[]"
}
}
}If making the function suspend is not feasible right now, at minimum ensuring it is always dispatched to Dispatchers.IO at the call sites will contain the thread-blocking to the IO pool.
Prompt To Fix With AI
This is a comment left during a code review.
Path: app/src/main/java/com/lxmf/messenger/service/manager/RoutingManager.kt
Line: 112-113
Comment:
**`@Synchronized` can starve coroutine dispatcher threads on a cache miss**
`@Synchronized` makes the JVM monitor the `RoutingManager` instance. If `getPathTableHashes()` is called concurrently from coroutines running on `Dispatchers.IO` while the cache is cold/expired, every waiting coroutine pins a thread pool thread at the lock boundary for the full duration of the Python call (up to ~39 s). With enough concurrent callers this can exhaust the `Dispatchers.IO` thread pool (64 threads by default on Android) and stall other IO operations.
A coroutine-friendly alternative is a `kotlinx.coroutines.sync.Mutex`, which suspends rather than blocks the thread:
```kotlin
private val cacheMutex = Mutex()
suspend fun getPathTableHashes(): String {
val now = System.currentTimeMillis()
val cached = cachedPathTableJson
if (cached != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) {
return cached
}
return cacheMutex.withLock {
// Re-check inside the lock in case another coroutine already refreshed
val rechecked = cachedPathTableJson
if (rechecked != null && (now - cachedPathTableTimestamp) < PATH_TABLE_CACHE_TTL_MS) {
return@withLock rechecked
}
val result = wrapperManager.withWrapper { wrapper ->
try {
val pyResult = wrapper.callAttr("get_path_table")
val hashes = pyResult.asList().map { it.toString() }
JSONArray(hashes).toString()
} catch (e: Exception) {
Log.e(TAG, "Error getting path table hashes", e)
null
}
}
if (result != null) {
cachedPathTableJson = result
cachedPathTableTimestamp = System.currentTimeMillis()
result
} else {
cachedPathTableJson ?: "[]"
}
}
}
```
If making the function `suspend` is not feasible right now, at minimum ensuring it is always dispatched to `Dispatchers.IO` at the call sites will contain the thread-blocking to the IO pool.
How can I resolve this? If you propose a fix, please make it concise.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Summary
RemoteExceptiononprovideAlternativeRelay()binder callbacks to prevent unhandledDeadObjectExceptionfatal crashes (COLUMBA-14: 95 events/27 users, COLUMBA-30: 4 events/4 users)getPathTableHashes()results for 30s inRoutingManagerto avoid blocking binder threads for 39+ seconds on large path tables, which saturates the 1MB per-process binder bufferRoot Cause
Binder transactions to
IReticulumService(particularlygetPathTableHashes, code 19) were taking 39-44 seconds, holding binder threads and filling the 1MB per-process binder buffer. Once saturated, Room'senableMultiInstanceInvalidation()binder IPC also fails withDeadObjectException, causing a crash loop that doesn't self-recover.Test plan
compileNoSentryDebugKotlinbuilds cleanAnnounceStreamViewModelTestpasses (exercisesgetPathTableHashespath)Fixes #647
Fixes COLUMBA-14
Fixes COLUMBA-30
🤖 Generated with Claude Code