Skip to content

[Bug]: Matrix thread session key case-normalizes event IDs, causing duplicate stuck sessions and thread delivery failures #75670

@jarvis-mns1

Description

@jarvis-mns1

Bug type

Crash (process/app exits or hangs)

Beta release blocker

No

Summary

OpenClaw lowercases Matrix event IDs when constructing thread session keys (sessionKey = ...thread:$<lowercased_event_id>), but also creates a second session with the original mixed-case event ID. This causes two compounding failures:

  1. Duplicate stuck sessions: Every Matrix thread spawns two sessions — one with the original event ID and one with the lowercased version. They deadlock each other (one reports active_embedded_run, the other no_active_work), and neither recovers.

  2. Thread reply delivery failures: When constructing m.relates_to relations for thread replies, the lowercased event ID is used. Synapse rejects these with [400] Can't send relation to unknown event because Matrix event IDs are case-sensitive per the spec.

Gateway restarts temporarily clear the stuck sessions, but any new thread reply immediately fails again because the case mismatch persists.

Steps to reproduce

  1. Configure OpenClaw with a Matrix channel and threadReplies: "always"
  2. Have a user send a message in a Matrix room thread
  3. Observe two session keys created for the same thread (e.g., thread:$lSTsAlY... and thread:$lstsaly...)
  4. Thread delivery fails with MatrixError: [400] Can't send relation to unknown event
  5. Both sessions enter state=processing and never recover

Expected behavior

  • Matrix event IDs should be treated as case-sensitive throughout the pipeline (per the Matrix spec)
  • Only one session should be created per thread, using the original event ID
  • Thread replies should use the original event ID in m.relates_to relations

Actual behavior

  • Two sessions created per thread: one with original case, one lowercased
  • Both sessions deadlock (diagnostic logs show alternating active_embedded_run / no_active_work)
  • Thread replies fail: MatrixError: [400] Can't send relation to unknown event
  • 443 delivery failures logged across 3 rooms over 10 days
  • 490 case-sensitive unique thread event IDs collapse to 249 case-insensitive — nearly every thread is affected

Example stuck session pair from logs:

[diagnostic] stuck session: sessionId=unknown sessionKey=...thread:$lSTsAlYrc_KOmteNbX6zqQxY5ZKMlYa79A7EArC4Jrg state=processing age=144s
[diagnostic] stuck session: sessionId=main sessionKey=...thread:$lstsalyrc_komtenbx6zqqxy5zkmlya79a7earc4jrg state=processing age=134s

Workaround

Setting threadReplies: "off" in the Matrix channel config stops both the duplicate sessions and delivery failures. All messages route to the room session instead.

OpenClaw version

2026.4.29 (a448042)

Operating system

macOS 26.2 (Darwin 25.2.0, arm64)

Install method

npm global (/opt/homebrew/lib/node_modules/openclaw), Node v25.8.1, launched via launchd (ai.openclaw.gateway)

Model

anthropic/claude-opus-4-6

Provider / routing chain

openclaw -> anthropic (direct)

Additional provider/model setup details

  • Matrix homeserver: Synapse (self-hosted, private network)
  • Active plugins: lossless-claw
  • The issue affects all Matrix rooms with threads, not specific to any room or thread

Logs, screenshots, and evidence

Stuck session diagnostic pairs (case collision visible):

2026-05-01T07:28:49.423 stuck session: sessionKey=...thread:$lSTsAlYrc_KOmteNbX6zqQxY5ZKMlYa79A7EArC4Jrg state=processing age=144s queueDepth=1
2026-05-01T07:28:49.424 stuck session: sessionKey=...thread:$lstsalyrc_komtenbx6zqqxy5zkmlya79a7earc4jrg state=processing age=134s queueDepth=0

Thread delivery failures:

2026-05-01T08:23:30.711 [delivery-recovery] Retry failed: MatrixError: [400] Can't send relation to unknown event
2026-05-01T08:38:07.270 [restart-sentinel] outbound delivery failed: MatrixError: [400] Can't send relation to unknown event

Scale: 443 "unknown event" failures logged from 2026-04-21 to 2026-05-01 across rooms !CtQaaSFRhaLfsgIJFh (196), !ZatHbfixtvTOjbQoYr (196), !dYPXGBxGPiWXDPdUnz (50).

Related issues

Partially related to #71127 (stuck sessions not auto-aborted), but this is a distinct root cause — the case normalization creates the stuck condition in the first place, and thread delivery fails independently of session state.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions