fix(agents): prioritize manual session turns#82765
Conversation
|
Codex review: needs maintainer review before merge. Summary Reproducibility: yes. source-level: current main queues same-lane entries FIFO and the linked issue provides redacted queue logs showing a manual follow-up waiting while background/session work occupied the lane. I did not rerun the full live Control UI plus cron scenario. Real behavior proof Next step before merge Security Review detailsBest possible solution: Land this PR or an equivalent internal scheduler change after maintainer approval, preserving active-run non-preemption and FIFO within priority. Do we have a high-confidence way to reproduce the issue? Yes, source-level: current main queues same-lane entries FIFO and the linked issue provides redacted queue logs showing a manual follow-up waiting while background/session work occupied the lane. I did not rerun the full live Control UI plus cron scenario. Is this the best way to solve the issue? Likely yes: centralizing priority selection in Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against e98ebb5739c4. |
Stronger redacted log proofThe original incident is backed by both the session trace summary and raw gateway queue diagnostics. This is a before-fix proof of the reported behavior; it shows a real same-session queue ordering problem, not a permanent hang.
Redactions applied: exact session key, run/session ids, local session file path, private reviewed item URL, and setup-specific scheduled job name. |
|
@clawsweeper re-review Prepared the proof gate after adding the Azure Crabbox evidence to the PR body. Behavior addressed: manual/user embedded session work is selected ahead of queued cron/background work once the active session-lane task releases. Observed result after fix: with one active session-lane blocker, queued foreground work ran before normal and background work after release. The focused embedded-run test also verifies user-triggered session work is marked foreground and cron-triggered session work is marked background. Latest proof gate: Real behavior proof run 25975977185, job 76356055240, passed. |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
1efa62a to
cc80630
Compare
cc80630 to
e0d4f2b
Compare
|
Pre-merge verification for current head
Known proof gap: live Control UI/gateway plus scheduled cron repro was not rerun; verification covers the changed queue/runtime seam, PR checks, focused tests, and the after-fix queue ordering proof. |
* fix(gateway): clear CLI bindings on session reset * fix(gateway): preserve spawned sessions in configured lists * fix(channels): clear canonical stale routes * fix(telegram): preserve forum topic origin targets * fix(agents): skip fallback for session coordination errors * fix(agents): persist subagent registry before returning accepted (openclaw#83132) (openclaw#83238) * fix(memory): catch up stale sessions on startup (openclaw#82341) * fix(memory): preserve qmd lexical search for hyphenated queries (openclaw#81423) * fix(anthropic): preserve Claude image capability (openclaw#83756) * fix(agents): exclude tool result details from guard budget (openclaw#75525) * fix(provider): use Together video API endpoint * fix(telegram): preserve implicit default account (openclaw#82794) * fix(gateway): allow trusted-proxy local-direct password fallback (openclaw#82953) * fix(discord): return subagent thread delivery origin * fix: add missing prerequisites for upstream-ported fixes Add SessionWriteLockTimeoutError class and hasSessionWriteLockTimeout helper needed by the ported fix(agents) skip-fallback commit. Remove route property references from session-delivery.ts that don't exist in gemmaclaw's SessionEntry type. Add authorizePasswordAuth helper that was present in upstream but missing from gemmaclaw's auth.ts. * fix: remove route assertions incompatible with gemmaclaw SessionEntry Remove test assertions using .route property that exists in upstream's SessionEntry type but not in gemmaclaw's, restoring typecheck green. * fix(memory-core): yield event loop during fallback vector search (openclaw#81172) (openclaw#83758) Summary: - The branch changes memory-core fallback vector search to scan chunks in 256-row rowid batches with `setImmediate` yields, updates regression tests, and adds a changelog entry. - Reproducibility: yes. from source and supplied live output. Current main synchronously scans fallback vector ... and the PR body shows the before/after heartbeat behavior through the actual `searchVector` fallback path. Automerge notes: - PR branch already contained follow-up commit before automerge: test(memory-core): add boundary, parity, and concurrent-insert covera… - PR branch already contained follow-up commit before automerge: fix(memory-core): yield event loop during fallback vector search (#81… Validation: - ClawSweeper review passed for head 0ede3d7. - Required merge gates passed before the squash merge. Prepared head SHA: 0ede3d7 Review: openclaw#83758 (comment) Co-authored-by: NW <nitinwadhawan66@gmail.com> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com> Approved-by: takhoffman Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com> * fix(subagents): collect unresolved announce batches (openclaw#83701) Summary: - The PR changes collect-mode follow-up queue routing so unresolved-origin items can batch with a single resolved route and later compatible items can resume batching after a true cross-channel drain. - Reproducibility: yes. at source level: current main treats unkeyed-plus-same-keyed queue items as cross-chan ... failing path is directly visible in `src/utils/queue-helpers.ts` and `src/auto-reply/reply/queue/drain.ts`. Automerge notes: - PR branch already contained follow-up commit before automerge: Merge remote-tracking branch 'origin/main' into maint-83701-20260518 Validation: - ClawSweeper review passed for head e6ad029. - Required merge gates passed before the squash merge. Prepared head SHA: e6ad029 Review: openclaw#83701 (comment) Co-authored-by: Andy Ye <35905412+TurboTheTurtle@users.noreply.github.com> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com> Approved-by: takhoffman Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com> * fix(config): accept gateway remote port * fix: restore Array<{}> closing bracket in manager-search.ts Cherry-pick 68b3729 accidentally dropped the '>' from '}>', producing a syntax error. Restore '}>;' as it was in origin/main. * fix: add remotePort to GatewayRemoteConfig and GatewayRemoteConfigSchema * fix(agents): prioritize manual session turns (openclaw#82765) * fix(agents): prioritize manual session turns * docs: update changelog for session priority --------- Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com> * revert: fix(agents): prioritize manual session turns (openclaw#82765) - upstream deps not in gemmaclaw * fix: resolve undefined variable errors in cherry-picked extension code * fix(tui): preserve draft while chat is busy * fix(tui): add pendingChatRunId to TuiStateAccess for cherry-picked tui commit * fix(memory-wiki): make wiki_lint tool output path-safe (openclaw#83687) * fix(ui): render session-scoped tool events (openclaw#83734) * chore: regenerate base config schema after upstream cherry-picks * fix(agents): add persistSubagentRunsToDiskOrThrow to subagent-registry test mock New export added to subagent-registry-state.ts was missing from the vi.mock definition, causing all tests in the suite to skip and the module to fail to load. * fix(telegram): wire buildTelegramInboundOriginTarget into session context Cherry-pick 675e053 added the helper and the test assertion but did not update bot-message-context.session.ts to use it. OriginatingTo now correctly includes :topic:<id> for forum groups. * fix(memory): correct session path format in startup-catchup test sessionPathForFile returns sessions/<basename> (no agent dir), but the cherry-picked test used sessions/main/<basename>. The clean-file test always failed because the path mismatch made every file look unindexed. * fix(together): update video generation test URL from v1 to v2 The source uses TOGETHER_VIDEO_BASE_URL = https://api.together.xyz/v2 but the cherry-picked test still asserted the old v1 URL. --------- Co-authored-by: nitinjwadhawan <nitinwadhawan66@gmail.com> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com> Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com> Co-authored-by: Andy Ye <35905412+TurboTheTurtle@users.noreply.github.com> Co-authored-by: Peter Steinberger <steipete@gmail.com> Co-authored-by: Galin Iliev <iliev@galcho.com> Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com> Co-authored-by: Harry Xie <harryhsieh963@yahoo.com>
* fix(agents): prioritize manual session turns * docs: update changelog for session priority --------- Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
* fix(agents): prioritize manual session turns * docs: update changelog for session priority --------- Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
* fix(agents): prioritize manual session turns * docs: update changelog for session priority --------- Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
* fix(agents): prioritize manual session turns * docs: update changelog for session priority --------- Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
* fix(agents): prioritize manual session turns * docs: update changelog for session priority --------- Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
* fix(agents): prioritize manual session turns * docs: update changelog for session priority --------- Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
* fix(agents): prioritize manual session turns * docs: update changelog for session priority --------- Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
* fix(agents): prioritize manual session turns * docs: update changelog for session priority --------- Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
* fix(agents): prioritize manual session turns * docs: update changelog for session priority --------- Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
* fix(agents): prioritize manual session turns * docs: update changelog for session priority --------- Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
* fix(agents): prioritize manual session turns * docs: update changelog for session priority --------- Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
* fix(agents): prioritize manual session turns * docs: update changelog for session priority --------- Co-authored-by: Galin Iliev <Galin.Iliev@microsoft.com>
Summary
Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Real behavior proof (required for external PRs)
Behavior addressed: Manual/user embedded turns are selected ahead of already queued cron/maintenance turns in the same session lane once the active turn releases the lane.
Real environment tested: Azure Crabbox Linux lease
swift-hermit/cbx_92ae6d7312ee, remote workdir/work/crabbox/cbx_92ae6d7312ee/openclaw-pr-82765, evidence head501421bdfacc41b46150b2844a6b770552ebee29; current prepared heade0d4f2b4b792772cc29adbd67a236799047d8930is a rebase/changelog refresh with the focused regression tests rerun locally. Fix code path unchanged except rebase SHA.Exact steps or command run after this patch:
/home/galini/.codex/skills/crabbox-oc-evidence/scripts/collect-openclaw-evidence.sh --id swift-hermit --name pr-82765-manual-session-priority-rerun --skip-bootstrap --skip-install --test src/process/command-queue.test.ts --test src/agents/pi-embedded-runner/run.overflow-compaction.test.ts --command <queue-priority-probe>Evidence after fix: Console output copied from the Azure Crabbox after-fix run:
Observed result after fix: The copied console output shows one active session-lane blocker completing first, then queued work draining in priority order after release: foreground first, normal second, background last. The focused embedded-run regression also verified user-triggered session work is marked foreground and cron-triggered session work is marked background.
What was not tested: A live Control UI/gateway plus scheduled cron repro was not rerun; this proof exercises the changed queue/runtime seam and focused regression tests on the PR head.
Before evidence (optional but encouraged): Redacted local evidence in #82764 and the raw gateway logs proves the ordering/timing:
Queued (1)while the same session showedCompacting context....[assistant turn failed before producing content]with an over-context error; line 1149 immediately recorded compaction.sessions_yieldprogress withdisplay=false.gateway-dev-1.log:34179recordedmessage queued ... source=pi-embedded-runner queueDepth=2 sessionState=processing.gateway-dev-1.log:34189recorded the active run completing withqueueDepth=1;gateway-dev-1.log:34204recorded a new run starting withqueueDepth=1.gateway-dev-1.log:34196recordedlane wait exceeded ... waitedMs=20967 queueAhead=1;gateway-dev-1.log:34235recordedlane wait exceeded ... waitedMs=35984 queueAhead=0.gateway-dev-1.log:34334recordedcompactionSummary:1;gateway-dev-1.log:34411andgateway-dev-1.log:34417recorded the later drain toqueueDepth=0and lanequeued=0.This establishes the before-fix behavior: a human-visible same-session follow-up could be queued during compaction while background/session work still occupied the lane, then the queue eventually drained, making the UI state misleading rather than permanently stuck.
Root Cause (if applicable)
sessions_yieldprogress made the visible UI state look idle/stuck while queued work continued internally.Regression Test Plan (if applicable)
src/process/command-queue.test.ts;src/agents/pi-embedded-runner/run.overflow-compaction.test.tsUser-visible / Behavior Changes
Manual/user follow-ups in a session are favored over background cron/maintenance turns that are queued but not yet active on the same session lane.
Diagram (if applicable)
Security Impact (required)
Yes/No) NoYes/No) NoYes/No) NoYes/No) NoYes/No) NoYes, explain risk + mitigation: N/ARepro + Verification
Environment
Steps
Expected
Actual
Evidence
Attach at least one:
Stronger redacted log proof
The original incident is backed by both the session trace summary and raw gateway queue diagnostics. This is a before-fix proof of the reported behavior; it shows a real same-session queue ordering problem, not a permanent hang.
Queued (1)while the same session showedCompacting context...[assistant turn failed before producing content]with an over-context error; line 1149 immediately recorded a compaction entrysessions_yieldprogress withdisplay=falsegateway-dev-1.log:34179:message queued ... source=pi-embedded-runner queueDepth=2 sessionState=processinggateway-dev-1.log:34189:run_completed queueDepth=1;gateway-dev-1.log:34204:run_started queueDepth=1gateway-dev-1.log:34196:lane wait exceeded ... waitedMs=20967 queueAhead=1;gateway-dev-1.log:34235:lane wait exceeded ... waitedMs=35984 queueAhead=0gateway-dev-1.log:34334: pre-prompt context diagnostic includedcompactionSummary:1;gateway-dev-1.log:34411andgateway-dev-1.log:34417:queueDepth=0and lanequeued=0Redactions applied: exact session key, run/session ids, local session file path, private reviewed item URL, and setup-specific scheduled job name.
Human Verification (required)
What you personally verified (not just CI), and how:
git diff --checkpassed.Review Conversations
If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.
Compatibility / Migration
Yes/No) YesYes/No) NoYes/No) NoRisks and Mitigations