Skip to content

[Matrix] Lowercased room ID in delivery-recovery causes sync loop crash and permanent message loss #57321

@dlardo

Description

@dlardo

Summary

A failed delivery-recovery entry with a lowercased Matrix room ID causes the Matrix sync loop to crash permanently on gateway restart. The gateway stays running but Matrix is completely dead — no inbound messages are processed. The poisoned delivery persists across restarts, causing a crash loop.

Related: #19278 (same root cause — room ID case normalization)

Root cause

Session keys normalize Matrix room IDs to lowercase (e.g. !IEjZDNPucuFvKLrAQC:server!iejzdnpucufvklraqc:server). During normal operation, the Matrix SDK sends messages using the correct-case room ID from sync state, so this is invisible. However, when a message is queued for delivery recovery, the lowercased room ID from the session key is stored as the "to" target. On retry, Synapse returns 403 because the lowercased ID does not match.

Steps to reproduce

  1. Configure Matrix channel with a room that has mixed-case characters in its room ID
  2. Trigger a gateway crash while a message send is in-flight (e.g. via Anthropic API overload → websocket reconnect failure)
  3. A delivery entry is persisted with the lowercased room ID
  4. Restart the gateway
  5. delivery-recovery retries the send with the lowercased room ID
  6. Synapse returns 403 M_FORBIDDEN: User @bot:server not in room !lowercase...

Observed behavior

Queue 'message' giving up on event ~!iejzdnpucufvklraqc:matrix.lucidpacket.com:m1774818664725.0
[delivery-recovery] Retry failed for delivery 9b8f01a2: MatrixError: [403] User @dax:matrix.lucidpacket.com not in room !iejzdnpucufvklraqc:matrix.lucidpacket.com
[MatrixClient.sync] Sync no longer running: exiting.
[MatrixClient] FetchHttpApi: <-- GET .../sync [76ms AbortError: This operation was aborted]

After this, the gateway is running but Matrix sync is permanently dead. The delivery entry persists in delivery-queue/, so every subsequent restart triggers the same crash.

Expected behavior

  1. Room IDs should preserve original case in delivery entries (or be mapped back to the canonical form before retry)
  2. A failed delivery should not crash the Matrix sync loop — it should be moved to failed/ and sync should continue
  3. Delivery entries should expire after N retries instead of retrying indefinitely across restarts

Workaround

Manually move the poisoned delivery JSON from ~/.openclaw/delivery-queue/ to delivery-queue/failed/ and restart the gateway.

Environment

  • OpenClaw: 2026.3.24
  • Homeserver: Synapse (self-hosted)
  • OS: Manjaro Linux (x64)
  • Install method: npm

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions