Skip to content

fix(agents): make trajectory cleanup timeout configurable#81622

Merged
BunsDev merged 1 commit into
mainfrom
meow/issue-75839-trajectory-timeout
May 14, 2026
Merged

fix(agents): make trajectory cleanup timeout configurable#81622
BunsDev merged 1 commit into
mainfrom
meow/issue-75839-trajectory-timeout

Conversation

@BunsDev

@BunsDev BunsDev commented May 14, 2026

Copy link
Copy Markdown
Member

Summary

  • Problem: sessions.list latency around 10s and fixed 10s pi-trajectory-flush timeout under moderate session load #75839 still reports pi-trajectory-flush cleanup warnings at exactly 10000 ms under slow or large session stores, and origin/main still hard-codes the generic cleanup default when no per-call timeout is passed.
  • Why it matters: operators can already relocate or disable trajectory capture, but they cannot tune the cleanup warning threshold for slower disks or larger trajectory queues without patching source.
  • What changed: runAgentCleanupStep now resolves timeouts from explicit params first, then OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS for pi-trajectory-flush, then the general OPENCLAW_AGENT_CLEANUP_TIMEOUT_MS, then the existing 10s default.
  • What did NOT change (scope boundary): this does not close the broader sessions.list half of sessions.list latency around 10s and fixed 10s pi-trajectory-flush timeout under moderate session load #75839, does not change trajectory size caps, does not add config-schema surface, and does not alter queued writer flushing semantics already present on origin/main.

Duplicate and related work triage

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Real behavior proof (required for external PRs)

  • Behavior addressed: pi-trajectory-flush cleanup no longer has an unconfigurable fixed 10000 ms timeout; operators can set OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS while the default remains 10000 ms.
  • Real environment tested: local OpenClaw checkout, Node v24.13.0 for the direct tsx runtime proof, pnpm 10.33.2, no secrets or live channel credentials involved.
  • Exact steps or command run after this patch:
    1. Ran a direct tsx invocation of runAgentCleanupStep with step: "pi-trajectory-flush", a non-resolving cleanup promise, and env: { OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS: "25" }.
    2. Ran the focused Vitest file for cleanup timeout resolution.
    3. Ran format, docs, changed-lane, type/lint/import-cycle, and diff checks listed below.
  • Evidence after fix:
$ pnpm exec tsx -e '<invoke runAgentCleanupStep with OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS=25>'
{
  "elapsedMs": 27,
  "logs": [
    "agent cleanup timed out: runId=proof-run sessionId=proof-session step=pi-trajectory-flush timeoutMs=25"
  ]
}
$ pnpm test src/agents/run-cleanup-timeout.test.ts
Test Files  1 passed (1)
Tests  7 passed (7)
[test] passed 1 Vitest shard in 2.75s
  • Observed result after fix: the runtime helper uses the trajectory-specific environment value for the pi-trajectory-flush timeout, while focused tests cover trajectory-specific env, general cleanup env, explicit timeout precedence, explicit zero normalization, invalid env fallback, default timeout, and rejection logging.
  • What was not tested: full Raspberry Pi, Windows, or long-running production gateway soak. This PR is scoped to the cleanup timeout resolver; the broader sessions.list latency remains tracked separately by sessions.list latency around 10s and fixed 10s pi-trajectory-flush timeout under moderate session load #75839 and related sessions-list PRs.
  • Before evidence (optional but encouraged): origin/main still has AGENT_CLEANUP_STEP_TIMEOUT_MS = 10_000 and runAgentCleanupStep previously used params.timeoutMs ?? AGENT_CLEANUP_STEP_TIMEOUT_MS, so the pi-trajectory-flush call had no env/config-derived override.

Root Cause (if applicable)

  • Root cause: runAgentCleanupStep only accepted an explicit timeoutMs or fell back to the 10s constant, and the pi-trajectory-flush call site does not pass an override.
  • Missing detection / guardrail: existing coverage only asserted the hard-coded default timeout and cleanup rejection behavior; it did not cover operator-configurable cleanup timeouts.
  • Contributing context (if known): current origin/main already added trajectory writer event-loop yielding and 10 MiB live capture bounds, so the remaining source-reproducible timeout issue is the cleanup helper defaulting path.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/agents/run-cleanup-timeout.test.ts
  • Scenario the test should lock in: timeout resolution honors trajectory-specific env, falls back to general cleanup env, preserves explicit caller values, and ignores invalid env values.
  • Why this is the smallest reliable guardrail: the bug lives in the cleanup timeout resolver, not in gateway startup, channel transport, or trajectory serialization.
  • Existing test that already covers this (if any): the pre-existing test covered the default timeout warning only.
  • If no new test is added, why not: N/A.

User-visible / Behavior Changes

Operators can set OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS to tune the pi-trajectory-flush cleanup timeout warning threshold. Operators can set OPENCLAW_AGENT_CLEANUP_TIMEOUT_MS as a fallback for cleanup steps that do not pass an explicit timeout. Existing default behavior remains 10 seconds.

Diagram (if applicable)

Before:
pi-trajectory-flush -> runAgentCleanupStep -> fixed 10000 ms default

After:
pi-trajectory-flush -> explicit timeout? -> trajectory env? -> general cleanup env? -> 10000 ms default

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation: N/A. The new environment values are numeric timeout controls only; invalid values are ignored, and logs include only the resolved numeric timeout.

Repro + Verification

Environment

  • OS: macOS local checkout
  • Runtime/container: local Node/pnpm workspace; direct runtime proof used Node v24.13.0, repo supports Node 22+
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS=25 for direct runtime proof

Steps

  1. Confirm current main has no trajectory cleanup timeout env override.
  2. Apply this patch.
  3. Run the direct cleanup helper proof with the trajectory env override.
  4. Run focused tests and changed-surface gates.

Expected

  • pi-trajectory-flush uses the trajectory-specific env timeout when set.
  • Invalid or missing env values preserve the 10s default.
  • Explicit timeoutMs remains highest precedence.

Actual

  • Direct runtime proof logged timeoutMs=25 for pi-trajectory-flush.
  • Focused tests and changed checks passed.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Validation run after this patch:

pnpm install
pnpm docs:list
pnpm test src/agents/run-cleanup-timeout.test.ts
pnpm exec oxfmt --check --threads=1 src/agents/run-cleanup-timeout.ts src/agents/run-cleanup-timeout.test.ts
git diff --check
pnpm changed:lanes --json
pnpm check:changed
pnpm format:docs:check docs/tools/trajectory.md

Results:

pnpm changed:lanes --json -> lanes=core, coreTests, docs; all=false
pnpm check:changed -> lanes=core, coreTests, docs; passed
pnpm format:docs:check docs/tools/trajectory.md -> Docs formatting clean

Human Verification (required)

  • Verified scenarios: direct runtime proof of the env override, focused fake-timer coverage for timeout precedence/fallbacks, docs updated for the new operator knobs.
  • Edge cases checked: explicit timeout beats env, explicit zero still normalizes to 1 ms for backward compatibility, invalid env values fall back to default, general cleanup env applies to non-trajectory cleanup steps.
  • What you did not verify: live production deployment under the reported Raspberry Pi/Windows stores; this PR avoids claiming the broader sessions.list half is fixed.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

No bot review conversations exist on this PR yet.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) Yes
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps: Optional only. Set OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS=<milliseconds> before starting OpenClaw to tune trajectory flush cleanup warnings; otherwise no action required.

Risks and Mitigations

  • Risk: operators can set an excessively high timeout and delay timeout warnings for stuck cleanup work.
    • Mitigation: default remains 10000 ms; the knob is opt-in and explicit per-call timeouts still win.
  • Risk: invalid environment values could accidentally change cleanup behavior.
    • Mitigation: invalid, empty, zero, negative, or non-finite env values are ignored and fall back to the existing default.

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation agents Agent runtime and tooling size: S maintainer Maintainer-authored PR labels May 14, 2026
@clawsweeper

clawsweeper Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge.

Summary
The PR adds env-based timeout resolution for agent cleanup, including a trajectory-specific override, focused tests, docs, and changelog coverage.

Reproducibility: yes. Source inspection on current main shows runAgentCleanupStep hard-codes the 10,000 ms default when no timeout is passed, and the pi-trajectory-flush call site passes no override; I did not run tests in this read-only review.

Real behavior proof
Not applicable: The external proof gate does not apply to this MEMBER-authored PR; the body nevertheless includes terminal output from a direct runtime invocation showing the trajectory timeout override in use.

Next step before merge
No repair lane is needed; the remaining action is maintainer acceptance of the new env configuration surface plus normal merge validation.

Security
Cleared: The diff adds numeric environment parsing, local timeout selection, tests, docs, and changelog text without changing dependencies, CI, permissions, secret handling, network calls, or command execution.

Review details

Best possible solution:

If maintainers accept the env knob, land this focused cleanup-timeout resolver with its tests/docs while keeping the broader sessions.list work on #75839 separate.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection on current main shows runAgentCleanupStep hard-codes the 10,000 ms default when no timeout is passed, and the pi-trajectory-flush call site passes no override; I did not run tests in this read-only review.

Is this the best way to solve the issue?

Yes, conditional on accepting the env surface. Resolving this in the cleanup helper is narrower than changing trajectory writer semantics or bumping the global constant, and it keeps the sessions.list work separate.

Acceptance criteria:

  • pnpm test src/agents/run-cleanup-timeout.test.ts
  • pnpm exec oxfmt --check --threads=1 src/agents/run-cleanup-timeout.ts src/agents/run-cleanup-timeout.test.ts
  • pnpm format:docs:check docs/tools/trajectory.md
  • pnpm check:changed

What I checked:

  • Current main has fixed cleanup default: Current main defines AGENT_CLEANUP_STEP_TIMEOUT_MS as 10,000 and runAgentCleanupStep resolves to that constant when no per-call timeoutMs is supplied. (src/agents/run-cleanup-timeout.ts:3, 78eb92e62277)
  • Trajectory cleanup call passes no override: The pi-trajectory-flush cleanup call invokes runAgentCleanupStep without timeoutMs, so current main inherits the fixed 10,000 ms helper default. (src/agents/pi-embedded-runner/run/attempt.ts:4134, 78eb92e62277)
  • PR diff is focused on timeout resolution: The patch adds OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS and OPENCLAW_AGENT_CLEANUP_TIMEOUT_MS parsing in the cleanup helper, adds precedence/fallback tests, and documents the trajectory knob. (src/agents/run-cleanup-timeout.ts:4, d9886bc8a2da)
  • Related report confirms the same remaining symptom: The canonical related issue remains open and includes repeated production reports of pi-trajectory-flush timing out at exactly 10000 ms plus requests for OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS or a cleanup timeout knob.
  • History points to current cleanup and trajectory owners: GitHub path history shows @steipete added the cleanup timeout helper in 470098b and later bounded trajectory runtime flush behavior in 474bea1. (src/agents/run-cleanup-timeout.ts:1, 470098bd26f3)
  • No relevant maintainer note override found: Only a Telegram maintainer note was present; this PR touches agents and trajectory docs, not Telegram behavior. (.agents/maintainer-notes/telegram.md, 78eb92e62277)

Likely related people:

  • @steipete: Path history shows the cleanup timeout helper was added in 470098b, and recent trajectory runtime bounding work landed in 474bea1. (role: introduced behavior and recent area contributor; confidence: high; commits: 470098bd26f3, 474bea162b4d, 2e8e9cd6ca18; files: src/agents/run-cleanup-timeout.ts, src/agents/pi-embedded-runner/run/attempt.ts, src/trajectory/runtime.ts)
  • @scoootscooob: GitHub path history identifies the trajectory bundle export and default-on runtime capture work as the original trajectory runtime feature surface this timeout now affects. (role: feature introducer; confidence: medium; commits: a3d9c53db299; files: src/trajectory/runtime.ts, docs/tools/trajectory.md, src/trajectory/paths.ts)

Remaining risk / open question:

  • Maintainers should confirm the new environment variable names and scope before landing because they become operator-facing configuration.
  • No tests were run in this read-only review, so CI or the focused test command should verify the branch before merge.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 78eb92e62277.

@BunsDev BunsDev self-assigned this May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling docs Improvements or additions to documentation maintainer Maintainer-authored PR size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant