Skip to content

fix: bound trajectory runtime flush#77154

Merged
steipete merged 3 commits into
mainfrom
fix/trajectory-bounded-flush
May 4, 2026
Merged

fix: bound trajectory runtime flush#77154
steipete merged 3 commits into
mainfrom
fix/trajectory-bounded-flush

Conversation

@steipete

@steipete steipete commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #77124.

Verification

  • pnpm test src/agents/queued-file-writer.test.ts src/trajectory/runtime.test.ts src/trajectory/export.test.ts
  • Testbox pnpm check:changed: tbx_01kqs0dbsp3yy5k9cxgq5vc8jw, exit 0

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation agents Agent runtime and tooling size: M maintainer Maintainer-authored PR labels May 4, 2026
@clawsweeper

clawsweeper Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs changes before merge.

Summary
The PR bounds trajectory runtime payloads, live capture size, queued sidecar writes, and trajectory docs/tests/changelog for the flush stall reported in #77124.

Reproducibility: yes. source-level. Current main serializes runtime events before queueing and lets cleanup continue after the 10s timeout, and the PR blocker is also source-reproducible from the writer max-file no-op path.

Next step before merge
A narrow automated repair can address the max-file/drop accounting bug and add the missing regression coverage on this PR branch.

Security
Cleared: The diff changes local diagnostic persistence limits and redaction flow but adds no dependency, workflow, package, permission, network, or secret-handling expansion.

Review findings

  • [P1] Treat max-file skips as dropped writes — src/trajectory/runtime.ts:285-292
Review details

Best possible solution:

Keep the bounded recorder/writer direction, but make the recorder or writer account for existing sidecar bytes and max-file skips before landing so the linked bug closes through this PR.

Do we have a high-confidence way to reproduce the issue?

Yes, source-level. Current main serializes runtime events before queueing and lets cleanup continue after the 10s timeout, and the PR blocker is also source-reproducible from the writer max-file no-op path.

Is this the best way to solve the issue?

No, not as written. The bounded capture approach is the right maintainable direction, but it needs to treat max-file skips or pre-existing sidecar size as dropped capture before merge.

Full review comments:

  • [P1] Treat max-file skips as dropped writes — src/trajectory/runtime.ts:285-292
    The recorder now relies on writer.write() to decide when capture stopped, but the queued writer returns "queued" before safeAppendFile() can skip the append because the existing sidecar is already at maxFileBytes. For a later turn on a near-cap or over-cap trajectory file, runtime capture can keep building and sanitizing events without emitting the intended truncation marker, leaving the long-lived-session stall path only partially fixed.
    Confidence: 0.91

Overall correctness: patch is incorrect
Overall confidence: 0.88

Acceptance criteria:

  • pnpm test src/agents/queued-file-writer.test.ts src/trajectory/runtime.test.ts src/trajectory/export.test.ts
  • pnpm check:changed in Testbox before handoff if promoted by a maintainer

What I checked:

  • Current cleanup timeout leaves cleanup running: Current main races cleanup against a 10s timeout and logs on timeout while keeping the original cleanup promise alive with a late rejection handler. (src/agents/run-cleanup-timeout.ts:20, f0537e93fbe6)
  • Current runtime serializes before queueing: Current main sanitizes/stringifies each runtime trajectory event before queueing it, then flush waits for the queued writer. (src/trajectory/runtime.ts:203, f0537e93fbe6)
  • Current queued writer can silently skip max-file appends: The writer's maxFileBytes guard returns from safeAppendFile without exposing that no append happened to callers. (src/agents/queued-file-writer.ts:92, f0537e93fbe6)
  • PR trusts writer.write as the acceptance signal: The PR increments acceptedRuntimeBytes after writer.write returns anything other than "dropped", but the queued writer still returns before the async max-file guard can skip the append. (src/trajectory/runtime.ts:285, ceb110ce1f12)
  • PR writer still returns queued before file-cap outcome: safeAppendFile can return without writing when maxFileBytes is exceeded, while write still returns "queued" immediately after scheduling the chain. (src/agents/queued-file-writer.ts:153, ceb110ce1f12)
  • Regression test misses existing sidecar size: The new capture-budget test uses an in-memory writer that always accepts writes, so it does not cover a real near-cap or over-cap existing trajectory sidecar. (src/trajectory/runtime.test.ts:120, ceb110ce1f12)

Likely related people:

  • steipete: Local blame and file history show Peter Steinberger introduced the current main trajectory runtime, queued writer, and cleanup timeout code now being modified, with earlier nearby queued-writer refactor history in the same area. (role: recent maintainer and adjacent owner; confidence: high; commits: 605e89468ebf, 817b5812e10a; files: src/trajectory/runtime.ts, src/agents/queued-file-writer.ts, src/agents/run-cleanup-timeout.ts)
  • scoootscooob: The related merged trajectory export/default-on capture PR in the provided context added the recorder/export surface whose runtime sidecar behavior is involved here. (role: introduced trajectory feature surface; confidence: medium; commits: a3d9c53db299; files: src/trajectory/runtime.ts, src/trajectory/paths.ts, src/agents/queued-file-writer.ts)

Remaining risk / open question:

  • The remaining blocker should be verified with a focused case where an existing runtime sidecar is already near or over the 10 MiB capture cap.
  • This read-only pass did not rerun the focused tests or a live 50 MiB Gateway reproduction.

Codex review notes: model gpt-5.5, reasoning high; reviewed against f0537e93fbe6.

@steipete steipete force-pushed the fix/trajectory-bounded-flush branch 2 times, most recently from c1550b6 to ae53d26 Compare May 4, 2026 06:37
@steipete steipete force-pushed the fix/trajectory-bounded-flush branch from 4a1add6 to ceb110c Compare May 4, 2026 08:06

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ceb110ce1f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/trajectory/runtime.ts
Comment on lines +285 to +289
const result = writer.write(jsonlLine);
if (result === "dropped") {
captureStopped = true;
droppedEvents += 1;
droppedEventBytes += lineBytes;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Treat max-file no-op writes as dropped during runtime capture

When a session’s runtime sidecar is already near maxRuntimeFileBytes (for example, later turns writing to the same session.trajectory.jsonl), writer.write can still return success even though the append is skipped by the writer’s max-file guard, and this branch counts it as accepted bytes. In that case captureStopped is never triggered by the real file-limit drop, so the recorder keeps sanitizing/serializing subsequent events and may never emit trace.truncated, which defeats the new “stop capture at 10 MiB” behavior under exactly the long-lived-session scenario this fix targets.

Useful? React with 👍 / 👎.

@steipete steipete merged commit 474bea1 into main May 4, 2026
94 checks passed
@steipete steipete deleted the fix/trajectory-bounded-flush branch May 4, 2026 08:48
lxe pushed a commit to lxe/openclaw that referenced this pull request May 6, 2026
* fix: bound trajectory runtime flush

* fix: keep trajectory export cap compatible

* test: keep followup delivery test pure
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
* fix: bound trajectory runtime flush

* fix: keep trajectory export cap compatible

* test: keep followup delivery test pure
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026
* fix: bound trajectory runtime flush

* fix: keep trajectory export cap compatible

* test: keep followup delivery test pure
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
* fix: bound trajectory runtime flush

* fix: keep trajectory export cap compatible

* test: keep followup delivery test pure
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
* fix: bound trajectory runtime flush

* fix: keep trajectory export cap compatible

* test: keep followup delivery test pure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling docs Improvements or additions to documentation maintainer Maintainer-authored PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pi-trajectory-flush: 50MB trajectory file blocks event loop for 25+ minutes after flush timeout

1 participant