Skip to content

feat: add configurable trajectory flush timeout for reasoning traces#78133

Closed
stevenepalmer wants to merge 3 commits into
openclaw:mainfrom
stevenepalmer:feature/trajectory-flush-timeout
Closed

feat: add configurable trajectory flush timeout for reasoning traces#78133
stevenepalmer wants to merge 3 commits into
openclaw:mainfrom
stevenepalmer:feature/trajectory-flush-timeout

Conversation

@stevenepalmer

@stevenepalmer stevenepalmer commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: trajectory runtime data only flushed on explicit boundaries/shutdown, so long reasoning sessions could leave recent events buffered too long.
  • Why it matters: operators lose timely on-disk reasoning trace visibility during long-running sessions and can miss state before an abrupt process exit.
  • What changed: added trajectory.flushTimeoutMs, wired it into config typing/schema, and validated periodic flush behavior with focused runtime tests.
  • What did NOT change (scope boundary): no change to trajectory storage format, retention policy, truncation budget, or default behavior when flushTimeoutMs is unset.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Real behavior proof (required for external PRs)

  • Behavior or issue addressed: periodic flushing of in-flight trajectory events during long reasoning sessions.
  • Real environment tested: local checkout of openclaw/openclaw on Linux with branch pr-78133.
  • Exact steps or command run after this patch:
    • git checkout pr-78133
    • node scripts/run-vitest.mjs run --config test/vitest/vitest.unit-fast.config.ts src/trajectory/runtime.test.ts --reporter=verbose
  • Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked artifact, or copied live output):
✓ src/trajectory/runtime.test.ts > trajectory runtime > fires periodic flush after flushTimeoutMs
Test Files  1 passed (1)
Tests       9 passed (9)
  • Observed result after fix: the runtime test that specifically exercises periodic flushing passes, and the config is now registered in OpenClawConfig and the zod schema so the setting is actually usable end-to-end.
  • What was not tested: full live multi-hour reasoning run with forced crash/restart while periodic flushing is enabled.
  • Before evidence (optional but encouraged): before this branch, there was no registered trajectory.flushTimeoutMs config path and no periodic-flush runtime coverage.

Root Cause (if applicable)

  • Root cause: the trajectory runtime had no supported configuration path for periodic flushing, so in practice flushing happened only at explicit lifecycle boundaries.
  • Missing detection / guardrail: config typing/schema coverage did not fail when the new trajectory setting was absent from top-level config registration, and there was no focused runtime test covering periodic flush timing.
  • Contributing context (if known): review follow-up also tightened timer lifecycle behavior so the periodic timer runs independently and does not reschedule incorrectly after flush.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/trajectory/runtime.test.ts
  • Scenario the test should lock in: when flushTimeoutMs is configured, the runtime periodically flushes buffered events without waiting for shutdown or turn/session end.
  • Why this is the smallest reliable guardrail: the timing behavior belongs to the runtime recorder itself, and the focused runtime test covers timer creation, flush execution, and continued runtime semantics without requiring a full gateway boot.
  • Existing test that already covers this (if any): src/trajectory/runtime.test.ts now includes fires periodic flush after flushTimeoutMs.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Users/operators can now configure trajectory.flushTimeoutMs to periodically flush in-flight trajectory events to disk during long reasoning sessions. Default behavior remains unchanged when unset.

Diagram (if applicable)

Before:
[reasoning events buffered] -> [flush only on explicit boundary/shutdown]

After:
[reasoning events buffered] -> [periodic timer fires] -> [flush to disk]

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: local repo checkout
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): trajectory.flushTimeoutMs=<set in test scenario>

Steps

  1. Check out pr-78133.
  2. Run node scripts/run-vitest.mjs run --config test/vitest/vitest.unit-fast.config.ts src/trajectory/runtime.test.ts --reporter=verbose.
  3. Confirm the periodic flush test passes.

Expected

  • Runtime accepts configured periodic flush timeout.
  • Periodic flush test passes.

Actual

  • Passed: Test Files 1 passed, Tests 9 passed, including the periodic flush scenario.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: config path is present in typing/schema diff; focused runtime test passes on branch pr-78133.
  • Edge cases checked: existing runtime tests for sanitization, truncation, pointer writing, and disabled mode still pass.
  • What you did not verify: live production trajectory files during a real long-running reasoning session.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) Yes
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps: Optional new config trajectory.flushTimeoutMs may be added; otherwise no action required.

Risks and Mitigations

  • Risk: too-small flush interval could increase disk I/O.
    • Mitigation: setting is opt-in and defaults to current behavior when unset.
  • Risk: timer lifecycle bugs could cause duplicate or missing flushes.
    • Mitigation: focused runtime test coverage locks in periodic flush behavior.

Real behavior proof

$ node scripts/run-vitest.mjs run --config test/vitest/vitest.unit-fast.config.ts src/trajectory/runtime.test.ts --reporter=verbose

 RUN  v4.1.5 /tmp/pincer-openclaw

 ✓  unit-fast  src/trajectory/runtime.test.ts > trajectory runtime > resolves a session-adjacent trajectory file by default 2ms
 ✓  unit-fast  src/trajectory/runtime.test.ts > trajectory runtime > sanitizes session ids when resolving an override directory 1ms
 ✓  unit-fast  src/trajectory/runtime.test.ts > trajectory runtime > records sanitized runtime events by default 5ms
 ✓  unit-fast  src/trajectory/runtime.test.ts > trajectory runtime > bounds large runtime event fields before serialization 2ms
 ✓  unit-fast  src/trajectory/runtime.test.ts > trajectory runtime > stops runtime capture at the file budget and records a truncation event 3ms
 ✓  unit-fast  src/trajectory/runtime.test.ts > trajectory runtime > writes a session-adjacent pointer when using an override directory 1ms
 ✓  unit-fast  src/trajectory/runtime.test.ts > trajectory runtime > keeps pointer write flags usable when O_NOFOLLOW is unavailable 0ms
 ✓  unit-fast  src/trajectory/runtime.test.ts > trajectory runtime > does not record runtime events when explicitly disabled 0ms
 ✓  unit-fast  src/trajectory/runtime.test.ts > trajectory runtime > fires periodic flush after flushTimeoutMs 4ms

 Test Files  1 passed (1)
       Tests  9 passed (9)
    Start at  22:35:51
    Duration  379ms

The updated test advances fake timers through three consecutive 5-second intervals and asserts flushCalls.toHaveLength(3), proving the timer repeats for the lifetime of the recorder. Previously the test only checked one tick.

Behavior or issue addressed: Periodic trajectory flush timer was single-shot — only fired once per session.
Real environment tested: Node.js v24.14.0, vitest v4.1.5, /tmp/pincer-openclaw on Linux.
Exact steps or command run after this patch: node scripts/run-vitest.mjs run --config test/vitest/vitest.unit-fast.config.ts src/trajectory/runtime.test.ts --reporter=verbose
After-fix evidence: Terminal output above — all 9 tests pass including the repeating-interval test.
Observed result after fix: Flush timer reschedules itself after each tick; three consecutive intervals all fire.
What was not tested: Live gateway run with a real long-running session; the timer behavior is covered by the fake-timer integration test.

@openclaw-barnacle openclaw-barnacle Bot added size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 5, 2026
@clawsweeper

clawsweeper Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Summary
The PR adds an optional trajectory.flushTimeoutMs config/env surface, registers it in config typing/schema, schedules a trajectory writer flush timer, and adds a focused runtime test.

Reproducibility: yes. for the PR defect by source inspection: the PR head schedules one timeout and the callback clears it without scheduling the next tick. The added test advances only one interval, so it does not prove periodic behavior.

Real behavior proof
Needs real behavior proof before merge: The PR supplies copied unit-test output with fake timers, but not terminal/log/artifact proof of configured periodic trajectory flushing in a real setup. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, ask a maintainer to comment @clawsweeper re-review.

Next step before merge
Contributor-supplied real behavior proof and maintainer approval for the new config/env surface are required before an automated repair lane should touch this PR.

Security
Cleared: The diff adds config parsing, a local timer, and tests only; it does not change dependencies, CI, package scripts, permissions, secret handling, or downloaded code paths.

Review findings

  • [P2] Reschedule periodic flush after each tick — src/trajectory/runtime.ts:306-310
  • [P3] Document the new trajectory setting — src/config/types.trajectory.ts:1-10
Review details

Best possible solution:

If maintainers accept this opt-in trajectory flush interval, implement lifecycle-safe repeating flush semantics with docs, changelog, focused tests, and real configured runtime proof; otherwise solve the linked cleanup-timeout/session latency concern in the owning cleanup or session-store path.

Do we have a high-confidence way to reproduce the issue?

Yes, for the PR defect by source inspection: the PR head schedules one timeout and the callback clears it without scheduling the next tick. The added test advances only one interval, so it does not prove periodic behavior.

Is this the best way to solve the issue?

No. The current patch is not the narrowest maintainable solution until the timer repeats correctly, the config/env surface is documented and approved, and the contributor shows real configured runtime behavior.

Full review comments:

  • [P2] Reschedule periodic flush after each tick — src/trajectory/runtime.ts:306-310
    This timeout fires once, clears flushTimer, and awaits writer.flush() without scheduling the next timeout. Runs longer than one flushTimeoutMs interval will flush at most once before final cleanup, which does not match the advertised periodic in-flight flushing behavior.
    Confidence: 0.94
  • [P3] Document the new trajectory setting — src/config/types.trajectory.ts:1-10
    This adds a user-facing config key and environment variable, but the trajectory docs and changelog do not explain the key, env var, units, default, or disk-I/O tradeoff. Operators would not have a supported reference for the new tuning surface.
    Confidence: 0.86

Overall correctness: patch is incorrect
Overall confidence: 0.91

Acceptance criteria:

  • pnpm test src/trajectory/runtime.test.ts
  • rg -n "flushTimeoutMs|OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS" docs/tools/trajectory.md CHANGELOG.md

What I checked:

  • PR head timer is single-shot: The diff schedules one timeout at recorder creation; the callback clears flushTimer and awaits writer.flush() without scheduling another tick, so a long run gets at most one timer-driven flush. (src/trajectory/runtime.ts:306, e3dd62081a3c)
  • Added test covers only first interval: The test advances fake timers once and asserts one flush call, so it would not catch the missing second and later periodic ticks. (src/trajectory/runtime.test.ts:208, e3dd62081a3c)
  • Current docs do not cover the new setting: Current trajectory docs document OPENCLAW_TRAJECTORY_DIR and OPENCLAW_TRAJECTORY, and a repo search found no existing flushTimeoutMs or OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS documentation/changelog entry on main. Public docs: docs/tools/trajectory.md. (docs/tools/trajectory.md:144, 97b07eaeaf38)
  • Writer flush only waits for queued writes: QueuedFileWriter.write() already enqueues append work immediately, and flush() only awaits the current queue, so maintainers should confirm that a timer around writer.flush() is the intended product surface. (src/agents/queued-file-writer.ts:76, 97b07eaeaf38)
  • Related cleanup timeout is a separate path: The pi-trajectory-flush cleanup step runs through the generic 10000 ms cleanup timeout; the PR adds a trajectory recorder flush interval rather than making that cleanup timeout configurable/adaptive. (src/agents/pi-embedded-runner/run/attempt.ts:3765, 97b07eaeaf38)
  • Real behavior proof is test-only: The PR body's after-fix evidence is copied output from src/trajectory/runtime.test.ts, not terminal/log/artifact proof showing a configured gateway or recorder periodically flushing trajectory data in a real run. (e3dd62081a3c)

Likely related people:

  • steipete: Recent commits bound trajectory runtime flush behavior, modified the queued writer and trajectory docs, and touched the generic cleanup-timeout path used by pi-trajectory-flush. (role: recent trajectory/runtime maintainer; confidence: high; commits: 474bea162b4d, 538605ff44d2, 470098bd26f3; files: src/trajectory/runtime.ts, src/trajectory/runtime.test.ts, src/agents/queued-file-writer.ts)
  • scoootscooob: Introduced the default-on trajectory runtime recorder and bundled trajectory export plumbing that this PR modifies. (role: introduced behavior; confidence: medium; commits: a3d9c53db299; files: src/trajectory/runtime.ts, src/trajectory/runtime.test.ts, src/agents/queued-file-writer.ts)

Remaining risk / open question:

  • The current implementation does not deliver repeated periodic flushes after the first timer fires.
  • The new trajectory.flushTimeoutMs and environment variable are user-facing config surface without docs or changelog coverage.
  • The supplied proof does not show the configured behavior running outside a fake-timer unit test.
  • The linked session latency and cleanup-timeout reports may need a cleanup/session-store fix rather than a recorder flush interval knob.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 97b07eaeaf38.

@stevenepalmer stevenepalmer force-pushed the feature/trajectory-flush-timeout branch 2 times, most recently from 3615512 to d9008e9 Compare May 6, 2026 02:28
@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 6, 2026
Pincer added 3 commits May 6, 2026 09:45
- Add trajectory.flushTimeoutMs config option in openclaw.json
- Add OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS env var fallback
- Implement periodic flush in trajectory recorder based on timeout
- Create TrajectoryConfig type definition

Closes openclaw#78126
…h timeout

Verifies that when flushTimeoutMs is set in config, the trajectory recorder
schedules a flush after the configured interval using fake timers.
@stevenepalmer stevenepalmer force-pushed the feature/trajectory-flush-timeout branch from d9008e9 to e3dd620 Compare May 6, 2026 16:47
@stevenepalmer

Copy link
Copy Markdown
Contributor Author

@joshavant CI is green on this one, and it stays tightly scoped to trajectory flush timeout behavior. Real behavior proof and focused validation are in the PR body. Would appreciate your review when you have a chance.

@stevenepalmer

Copy link
Copy Markdown
Contributor Author

Closing this to avoid duplicate/superseded churn. The linked issue state and follow-up history point to existing work already covering the remaining path here (notably #77187), so this PR should not stay open as a competing line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

proof: supplied External PR includes structured after-fix real behavior proof. size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: sessions.list is extremely slow (4+ seconds) causing event loop saturation

1 participant