fix(agents): make trajectory cleanup timeout configurable#81622
Conversation
|
Codex review: needs maintainer review before merge. Summary Reproducibility: yes. Source inspection on current main shows runAgentCleanupStep hard-codes the 10,000 ms default when no timeout is passed, and the pi-trajectory-flush call site passes no override; I did not run tests in this read-only review. Real behavior proof Next step before merge Security Review detailsBest possible solution: If maintainers accept the env knob, land this focused cleanup-timeout resolver with its tests/docs while keeping the broader sessions.list work on #75839 separate. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection on current main shows runAgentCleanupStep hard-codes the 10,000 ms default when no timeout is passed, and the pi-trajectory-flush call site passes no override; I did not run tests in this read-only review. Is this the best way to solve the issue? Yes, conditional on accepting the env surface. Resolving this in the cleanup helper is narrower than changing trajectory writer semantics or bumping the global constant, and it keeps the sessions.list work separate. Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 78eb92e62277. |
d9886bc to
d330193
Compare
Summary
pi-trajectory-flushcleanup warnings at exactly 10000 ms under slow or large session stores, andorigin/mainstill hard-codes the generic cleanup default when no per-call timeout is passed.runAgentCleanupStepnow resolves timeouts from explicit params first, thenOPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MSforpi-trajectory-flush, then the generalOPENCLAW_AGENT_CLEANUP_TIMEOUT_MS, then the existing 10s default.sessions.listhalf of sessions.list latency around 10s and fixed 10s pi-trajectory-flush timeout under moderate session load #75839, does not change trajectory size caps, does not add config-schema surface, and does not alter queued writer flushing semantics already present onorigin/main.Duplicate and related work triage
origin/mainhas severalsessions.listmitigations plus trajectory writer yielding, but the cleanup timeout remains fixed at 10s by default.origin/mainalready has the writer-yield/docs-cap pieces, while this PR implements the missing configurable timeout as a narrower replacement for that part.OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MSask.pi-trajectory-flushcleanup timeout path.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Real behavior proof (required for external PRs)
pi-trajectory-flushcleanup no longer has an unconfigurable fixed 10000 ms timeout; operators can setOPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MSwhile the default remains 10000 ms.tsxruntime proof, pnpm 10.33.2, no secrets or live channel credentials involved.tsxinvocation ofrunAgentCleanupStepwithstep: "pi-trajectory-flush", a non-resolving cleanup promise, andenv: { OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS: "25" }.pi-trajectory-flushtimeout, while focused tests cover trajectory-specific env, general cleanup env, explicit timeout precedence, explicit zero normalization, invalid env fallback, default timeout, and rejection logging.sessions.listlatency remains tracked separately by sessions.list latency around 10s and fixed 10s pi-trajectory-flush timeout under moderate session load #75839 and related sessions-list PRs.origin/mainstill hasAGENT_CLEANUP_STEP_TIMEOUT_MS = 10_000andrunAgentCleanupSteppreviously usedparams.timeoutMs ?? AGENT_CLEANUP_STEP_TIMEOUT_MS, so thepi-trajectory-flushcall had no env/config-derived override.Root Cause (if applicable)
runAgentCleanupSteponly accepted an explicittimeoutMsor fell back to the 10s constant, and thepi-trajectory-flushcall site does not pass an override.origin/mainalready added trajectory writer event-loop yielding and 10 MiB live capture bounds, so the remaining source-reproducible timeout issue is the cleanup helper defaulting path.Regression Test Plan (if applicable)
src/agents/run-cleanup-timeout.test.tsUser-visible / Behavior Changes
Operators can set
OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MSto tune thepi-trajectory-flushcleanup timeout warning threshold. Operators can setOPENCLAW_AGENT_CLEANUP_TIMEOUT_MSas a fallback for cleanup steps that do not pass an explicit timeout. Existing default behavior remains 10 seconds.Diagram (if applicable)
Security Impact (required)
Yes/No) NoYes/No) NoYes/No) NoYes/No) NoYes/No) NoYes, explain risk + mitigation: N/A. The new environment values are numeric timeout controls only; invalid values are ignored, and logs include only the resolved numeric timeout.Repro + Verification
Environment
OPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS=25for direct runtime proofSteps
Expected
pi-trajectory-flushuses the trajectory-specific env timeout when set.timeoutMsremains highest precedence.Actual
timeoutMs=25forpi-trajectory-flush.Evidence
Validation run after this patch:
Results:
Human Verification (required)
sessions.listhalf is fixed.Review Conversations
No bot review conversations exist on this PR yet.
Compatibility / Migration
Yes/No) YesYes/No) YesYes/No) NoOPENCLAW_TRAJECTORY_FLUSH_TIMEOUT_MS=<milliseconds>before starting OpenClaw to tune trajectory flush cleanup warnings; otherwise no action required.Risks and Mitigations