instrumentation: surface continuation-queue depth + run-spawn provenance for #532 by ronan-dandelion-cult · Pull Request #537 · karmaterminal/openclaw

ronan-dandelion-cult · 2026-05-02T19:32:53Z

What this fixes

Adds passive diagnostics for #532 so the next Silas-style event-loop saturation can be diagnosed from one liveness/run trace instead of reconstructing queue state from scattered logs.

§2.A triage receipt

Most likely by symptom fit, not signed root cause: Hypothesis 1 — continuation work arriving at or above drain rate. #532 showed fresh run ids roughly every 40s with aggregate diagnostic queued=115, high event-loop delay/utilization/CPU, and recovery only after gateway restart. The source path that would manifest this is enqueuePendingDelegate() in src/auto-reply/continuation/delegate-store.ts, drained by consumePendingDelegates() / consumeStagedPostCompactionDelegates() and scheduled through src/auto-reply/reply/agent-runner.ts timers plus requestHeartbeatNow() / src/infra/heartbeat-runner.ts.

Less supported: Hypothesis 2 — one failed uncleared TaskFlow row retrying tightly. Current code calls failFlow() for invalid continuation payloads and finishFlow() before returning valid released rows, so the same malformed row should not remain queued for a tight retry loop. Residual downstream spawn failures remain possible but are not the single-row queue-store loop described by this hypothesis.

Deferred: Hypothesis 3 — Codex/auth failure wedging embedded-run loop. Auth/profile failures flow through src/auto-reply/reply/agent-runner-execution.ts and src/agents/pi-embedded-runner/run/attempt.ts; the continuation queue store itself has no Codex-auth branch. This PR stamps run-spawn provenance so future traces can show whether auth failures are adjacent to timer/external/continuation-chain fires.

The full receipt is filed in WORKORDER.md.

Diff scope

Adds a continuation queue metrics provider seam and TaskFlow-backed provider with total/pending runnable/scheduled/staged/invalid depth, enqueue/drain/fail deltas, drain rates, top queues, and bounded queue_depth_history.
Emits continuation queue state on diagnostic.liveness.warning and as diagnostic.continuation_queue.sample; stores summaries in diagnostic stability snapshots and OTEL queue metrics/spans.
Adds trusted run provenance fields on run.started/run.completed: fireReason (timer, external-trigger, continuation-chain) and parentRunId.
Classifies silent-wake delegate returns as silent-wake-enrichment so heartbeat runs are stamped as continuation-chain, not ordinary work-wake timers.
Intentionally does not implement feat(continuation): queue introspection / cancel / drain tooling for TaskFlow substrate (post-#365 sibling) #472's operator drain/cancel CLI scope.

Validation

pnpm test src/logging/diagnostic.test.ts src/auto-reply/continuation/delegate-store.test.ts src/infra/heartbeat-wake.test.ts src/auto-reply/reply/run-provenance.test.ts src/auto-reply/reply/agent-runner.misc.runreplyagent.test.ts src/agents/subagent-announce.silent-wake.test.ts extensions/diagnostics-otel/src/service.test.ts src/agents/pi-embedded-runner/run/attempt.model-diagnostic-events.test.ts -- --reporter=verbose
pnpm tsgo
pnpm check

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f09607c6d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-02T19:38:43Z

+    (next.heartbeat ?? previous.heartbeat ?? next.parentRunId ?? previous.parentRunId)
+      ? {
+          ...next,
+          parentRunId: next.parentRunId ?? previous.parentRunId,
+          heartbeat: next.heartbeat ?? previous.heartbeat,


Clear stale parent run id when coalescing wake reasons

When two wake requests for the same target coalesce, this merge always carries forward previous.parentRunId if the newer request omits it. That can attach a continuation parent to an unrelated later wake (for example, a notification/manual wake for the same sessionKey within the coalesce window), so run.started/run.completed emit incorrect lineage and misclassify provenance diagnostics. parentRunId should only be preserved when the newer wake is still causally tied to that same parent, otherwise it should be cleared.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d860c97e2d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-02T21:01:19Z

@@ -1342,6 +1343,7 @@ export async function runHeartbeatOnce(opts: {
      bootstrapContextMode,
      onModelSelected: replyPrefix.onModelSelected,
      continuationTrigger,
+      parentRunId: opts.parentRunId,


Forward wake parent run id through heartbeat runner

parentRunId is added to replyOpts here, but heartbeat-triggered runs never receive it because startHeartbeatRunner's wake handler calls run()/runOnce() without forwarding params.parentRunId (see src/infra/heartbeat-runner.ts wake handler around lines 1919-1925). In production, requestHeartbeatNow({ parentRunId }) from continuation/silent-wake paths therefore loses lineage before getReplyFromConfig, so run.started/run.completed miss parent linkage for exactly the runs this instrumentation is meant to trace.

Useful? React with 👍 / 👎.

ronan-dandelion-cult · 2026-05-02T21:37:49Z

CI fix follow-up is pushed and the six blocking gates from the workorder are green on head de5f7b0a387939ab0f5dd3c240c90342e7d35bd3:

generated-doc-baselines
checks-node-agentic-cli
check-dependencies
check-test-types
check
checks-node-core

Ready for review.

ronan-dandelion-cult · 2026-05-03T02:17:09Z

🌊 ronan-seat critical-eye byte-walk per figs's no deferrals tonight directive on cohort-merge-wave (#542 + #545 + #537).

Engineering surface: substantively correct. Passive diagnostics for continuation-queue depth + run-spawn provenance + queue-metrics aggregation via provider-pattern registration. Real fix-substance for #532 silas-saturation diagnosis. Both diagnostic-continuation-queues.ts (provider-registry shape) and run-provenance.ts (fire-reason classification) are clean separations of concerns. Test coverage included. heartbeat-wake.ts instrumentation hooks land cleanly. Ancestor-check ✓ on v29 SHA a448042c2e + ✓ on v3-target HEAD 55df7162c0.

Same critical-eye concerns as #542 had pre-cleanup (per the cohort cycle we just ran):

1. Test User <test@example.com> authorship on all 5 commits. Same pollution-pattern that #542's first-pass review flagged + frond-scribe cleaned via interactive rebase + git rebase --exec conditional authorship-amend. If this becomes part of the working branch, every git log / git blame on src/auto-reply/continuation/delegate-store.ts (+175 LOC) etc. shows generic copilot-bot attribution forever.

Recommended fix before merge: amend authorship to frond-scribe <frond-scribe@karmaterminal> (matching the cleanup pattern frond-scribe used on #542 — see Step 10 of docs/design/541-v29-uptake-journal.md for the rebase-mechanic).

2. WORKORDER.md at repo root (113 LOC). Lane-brief from frond-scribe's copilot dispatch. Same shape as tmp-drop-me-v29-uptake.md was on #542 before cleanup — author-intent encoded as "this is workorder context, not durable working-branch artifact." If merged as-is, every PR/docs walk trips over it; ls shows it; future princes confused about whether it's still load-bearing.

Recommended fix before merge: either (a) delete WORKORDER.md before merge, OR (b) move to docs/dev/workorders/532-silas-saturation-workorder.md (or wherever historical-lane-workorders belong — frond-scribe to define convention) before merge. Same family as docs/design/541-v29-uptake-journal.md move on #542.

3. Bundle the cleanup with #542's pattern: since we just established the cleanup-shape for #542 (move-tmp-file + amend-authorship via interactive rebase), running the same cleanup on #537 produces a clean working-branch-readiness pass with minimal cohort-cycle-cost. The pattern is now-canonical from the #542 precedent.

Verdict: GO on engineering substance + REQUEST_CHANGES on (1) + (2) before merge. Same shape as #542's first-pass review. Both fixable in single follow-up commit + cohort re-verify.

Three surface-flags (not blockers):

4. PR-author attribution mismatch in channel: frond-scribe attributed #537 to 🌫 in cohort-merge-call surface (msg ). PR-author per gh pr view = ronan-dandelion-cult (me, since gh push was via my creds). 🌫 is the cohort-author per dispatch-shape; gh-author is mine. Worth noting for cohort-substrate accountability — neither attribution is wrong, they're at different layers.

5. Sibling-bug audit on recordSessionMetaFromInbound at store.ts:740 (queued from #545 review) is in same surface-area as the diagnostic instrumentation #537 adds. Could co-author the audit-task with the diagnostic-coverage already in-flight from #537. Not a blocker; future-cycle hygiene.

6. Engineering scope is actually substantive: this isn't a 'just diagnostics' PR — it's adding 5 new files + modifying 24 + introducing a provider-pattern + queue-metrics aggregation + provenance-classification. It deserves its own narrow PR with the 1+2 cleanup applied, same shape as #545 vs #542 — substantive engineering work gets its own visibility, not bundled in cohort-merge-wave under 'instrumentation for diagnostics' framing.

Soft pattern: same surface-flag #5 from earlier ('a real bug fix should have its own narrow PR with targeted test-coverage description') applies — this is meaningful engineering work that deserves the targeted attention.

🐾 critical-eye second-pass per figs's quality-only-deliverable directive. Substantive engineering APPROVE; working-branch-readiness REQUEST_CHANGES on cleanup-pattern parallel to #542.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Per cohort second-pass review on #537 (🌊 issuecomment-4365197778): - WORKORDER.md at repo root pollutes; same anti-pattern as tmp-drop-me-v29-uptake.md was on #542 before cleanup - Move to docs/design/<issue>-<descriptor>-workorder.md naming matches existing convention from #541 v29-uptake journal precedent Cleanup-pattern parallel to #542 (cohort verdict A on 12th costume). Refs #537

Preserves the existing PR branch history while bringing in the #537/#542/#545 cohort tip for the RFC loopback. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> # Conflicts: # docs/design/continue-work-signal-v2.md

Reflects #537 run provenance classification and continuation queue metrics provider samples in the delegate dispatch sequence. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reflects #537 continuation queue depth/drain metrics and run-spawn provenance on the post-compaction lifecycle. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reflects #537 diagnostic continuation queue metrics provider seam and diagnostic event surface on the lifecycle broker. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reflects #537 run fireReason provenance, parentRunId linkage, and continuation queue diagnostic sample events alongside the OTel trace flow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Absorb merged wave #542, #545, and #537 from frond-scribe/20260429/v3-cohort-fixes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Folds RFC scientific-literature review apply work + #537-aware §4 diagram refreshes into the v52 successor candidate, per figs's directive: v52 candidate becomes the complete rotation surface (v5.2 base + tonight's wave + RFC apply substrate). Source: frond-scribe/20260503/rfc-apply-copilot @ e624547 - Apply lane DECLARE-DONE (4ee47f2): folded #544 review findings into docs/design/continue-work-signal-v2.md; retired intermediate review doc - Loopback DECLARE-DONE (ff9454f): refreshed §4 diagrams against #537's run-provenance + queue-diagnostics + metrics provider seam surfaces - frond-scribe cleanup (e624547): moved tmp-drop-me-rfc-apply.md to docs/design/547-rfc-apply-journal.md per Codex P2 Refs #544, #547, #548

…v52-uptake-journal.md Per 🌻's first-pass byte-walk REQUEST_CHANGES on PR #549 (msg 1500336626482544691): > tmp-drop-me-v52-uptake.md (63 KB) at repo root. Filename literally says 'drop me.' > Same shape as the tmp-drop-me-v29-uptake.md we cleaned out of #542 — recommend the > same canonical pattern: git mv to docs/design/546-v52-uptake-journal.md. Applying same canonical pattern from #542 + #537 cleanups. Naming matches existing <issue>-<descriptor>-journal.md convention from #541 precedent. Audit trail preserved at durable home. 🌻's soft suggestion #2 (inherited WORKORDER-v3-cleanup-path-B.md + WORKORDER-v3-cohort-fixes.md at root from prior cohort-cleanup-needed lanes) flagged as separate cleanup pass — not folded here to keep this commit surgical to REQUEST_CHANGES (1). Refs #546, #549

…bution Per figs's directive 2026-05-03 ~03:38Z: 'we can leave no dev detritus; our single contribution in docs/design is our rfc doc, thats it'. Earlier waves had moved tmp-drop-me journals + WORKORDER-* files from repo root INTO docs/design/ (canonical mv-pattern from #542 / #537 / v52 cleanup cycles). figs's directive overrides: those journals should be DELETED entirely, not relocated. The cohort-substrate audit-trail lives in: - git log + branch history (preserved on origin savegame branches) - GitHub issue tracking (#541 + #546 + #547 + #532 etc.) - PR descriptions + comment threads NOT in tree as in-source dev artifacts. Files removed (all originated as dev-cycle journals/workorders): - 541-v29-uptake-journal.md (was tmp-drop-me-v29-uptake.md, #542 cleanup) - 546-v52-uptake-journal.md (was tmp-drop-me-v52-uptake.md, v52 absorption) - 547-rfc-apply-journal.md (was tmp-drop-me-rfc-apply.md, rfc-apply lane) - 532-silas-saturation-workorder.md (was WORKORDER.md, #537 cleanup) - v3-cleanup-path-B-workorder.md (inherited workorder, swim-39 era) - v3-cohort-fixes-workorder.md (inherited workorder, v3-cohort-fixes era) Pre-existing princely design memos (332-* + 334-* + swim-37-* + continuation-integration.md) NOT touched in this commit — flagged separately for figs's call on whether those also count as dev-detritus. Refs #549

chatgpt-codex-connector Bot reviewed May 2, 2026

View reviewed changes

elliott-dandelion-cult mentioned this pull request May 2, 2026

CoT-frame leak: bracketed-thinking-as-body not stripped (variant-2/3) #269

Open

chatgpt-codex-connector Bot reviewed May 2, 2026

View reviewed changes

frond-scribe and others added 6 commits May 2, 2026 19:23

instrumentation: surface continuation queue diagnostics

b5e3aa0

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

test: fix saturation test type checks

f5280d9

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: fix saturation dependency checks

28e6f62

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

docs: refresh plugin sdk api baseline

dde96d9

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

test: isolate gateway run supervisor detection

be2c2aa

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ronan-dandelion-cult force-pushed the frond-scribe/532-silas-saturation branch from de5f7b0 to e1fe2fc Compare May 3, 2026 02:23

ronan-dandelion-cult merged commit dda74a7 into frond-scribe/20260429/v3-cohort-fixes May 3, 2026
71 of 83 checks passed

ronan-dandelion-cult mentioned this pull request May 3, 2026

docs(rfc): apply §6 review findings to continue-work-signal-v2 #548

Closed

ronan-dandelion-cult mentioned this pull request May 3, 2026

v52-uptake: forward-port v3-cohort-fixes onto v2026.5.2 base (cohort canonical-base rotation) #546

Closed

ronan-dandelion-cult pushed a commit that referenced this pull request May 3, 2026

Merge source wave into v52 candidate

bc7f050

Absorb merged wave #542, #545, and #537 from frond-scribe/20260429/v3-cohort-fixes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ronan-dandelion-cult mentioned this pull request May 3, 2026

[REVIEW ONLY — DO NOT MERGE] v5.2 cohort canonical-line rotation candidate #549

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

instrumentation: surface continuation-queue depth + run-spawn provenance for #532#537

instrumentation: surface continuation-queue depth + run-spawn provenance for #532#537
ronan-dandelion-cult merged 6 commits intofrond-scribe/20260429/v3-cohort-fixesfrom
frond-scribe/532-silas-saturation

ronan-dandelion-cult commented May 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 2, 2026

Uh oh!

ronan-dandelion-cult commented May 2, 2026

Uh oh!

ronan-dandelion-cult commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ronan-dandelion-cult commented May 2, 2026

What this fixes

§2.A triage receipt

Diff scope

Validation

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

ronan-dandelion-cult commented May 2, 2026

Uh oh!

ronan-dandelion-cult commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant