Skip to content

sessions.list latency around 10s and fixed 10s pi-trajectory-flush timeout under moderate session load #75839

@BomBastikDE

Description

@BomBastikDE

Summary

We are observing consistent performance issues in OpenClaw related to session handling.

There are two related symptoms:

  1. sessions.list consistently takes about 10 to 16 seconds under moderate session load.
  2. pi-trajectory-flush regularly times out at exactly 10000 ms.

Local cleanup and pruning improve disk usage and general stability, but do not resolve the core latency.

Environment

  • OpenClaw: current local Docker deployment
  • Image: openclaw:local
  • Platform: Debian on ARM64 / Raspberry Pi
  • Storage: local filesystem
  • Deployment mode: Docker
  • Session store: agents/main/sessions/sessions.json

Observed state before local cleanup and pruning:

  • sessions.json: about 4.1 MB
  • Session entries: about 153
  • Active session directory had many trajectory and session artifact files
  • Trajectory files totaled several hundred MB

After cleanup and limiting active trajectories, sessions.list still remained around 10 seconds.

Problem 1: sessions.list latency

The sessions.list command consistently takes around 10 to 16 seconds.

Observed log examples:

⇄ res ✓ sessions.list 10186ms
⇄ res ✓ sessions.list 10230ms
⇄ res ✓ sessions.list 15978ms
⇄ res ✓ sessions.list 16379ms

This continued even after:

  • removing stale plugin-runtime-deps
  • deleting stale SQLite temporary files
  • deleting stale session temporary files
  • archiving large trajectory files
  • limiting active trajectory files to about 200
  • archiving session artifacts such as .reset, .bak, .deleted, .checkpoint

The current evidence suggests the issue is not only raw disk usage, but the session store loading path itself.

Problem 2: pi-trajectory-flush timeout

The agent cleanup step pi-trajectory-flush regularly times out at exactly 10000 ms.

Observed log examples:

agent cleanup timed out: runId=... sessionId=... step=pi-trajectory-flush timeoutMs=10000

Local code inspection suggests the timeout is currently hardcoded/defaulted around 10000 ms and is not externally configurable.

Local workarounds applied

The following local mitigations were applied successfully to improve disk usage and general stability:

  • cleanup of stale plugin-runtime-deps
  • cleanup of main.sqlite.tmp-*
  • cleanup of sessions.json.*.tmp
  • archiving old and large *.trajectory.jsonl
  • limiting active trajectory files to 200
  • archiving session artifacts:
    • *.jsonl.reset.*
    • *.jsonl.bak-*
    • *.trajectory.jsonl.deleted.*
    • *.checkpoint.*.jsonl
  • local maintenance job for nightly cleanup
  • planned local configuration workaround:
    • session.maintenance.maxEntries
    • session.maintenance.pruneDays
    • OPENCLAW_SESSION_CACHE_TTL_MS=120000

These workarounds reduce pressure and improve stability, but they do not address the root cause of the observed sessions.list latency.

Expected behavior

  • sessions.list should remain responsive with about 100 to 200 sessions.
  • A few MB of sessions.json should not lead to consistent 10 to 16 second latency.
  • Session store cache misses should not cause long UI stalls.
  • pi-trajectory-flush timeout should be configurable or adaptive.

Suggested improvements

  1. Improve session store loading and indexing

    • avoid full parse/clone for each cache miss where possible
    • consider indexed or incremental session metadata loading
    • reduce synchronous filesystem work in the sessions.list path
  2. Improve session cache behavior

    • smarter invalidation
    • avoid unnecessary full deep clone via JSON serialization if possible
    • allow better tuning for dashboard polling patterns
  3. Make trajectory flush timeout configurable

    • for example via environment variable such as OPENCLAW_AGENT_CLEANUP_TIMEOUT_MS
    • or allow a specific OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS
  4. Consider a lightweight sessions.list mode

    • no heavy per-session processing
    • explicit metadata depth flags
    • dashboard-oriented fast path

Reproduction outline

  1. Run OpenClaw with around 100 to 200 sessions.
  2. Let agents/main/sessions/sessions.json grow to a few MB.
  3. Call sessions.list from the dashboard or API.
  4. Observe latency around 10 seconds, especially after cache expiry or cache invalidation.
  5. Run agent interactions and observe recurring pi-trajectory-flush timeout warnings at 10000 ms.

Notes

This issue was identified during operational maintenance on a Raspberry Pi based OpenClaw installation. Filesystem cleanup and session artifact archiving reduced disk usage significantly and lowered general system pressure, but sessions.list remained slow. This suggests the remaining issue is in the session loading and cleanup implementation rather than only in accumulated local artifacts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions