instrumentation: surface continuation-queue depth + run-spawn provenance for #532#537
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1f09607c6d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| (next.heartbeat ?? previous.heartbeat ?? next.parentRunId ?? previous.parentRunId) | ||
| ? { | ||
| ...next, | ||
| parentRunId: next.parentRunId ?? previous.parentRunId, | ||
| heartbeat: next.heartbeat ?? previous.heartbeat, |
There was a problem hiding this comment.
Clear stale parent run id when coalescing wake reasons
When two wake requests for the same target coalesce, this merge always carries forward previous.parentRunId if the newer request omits it. That can attach a continuation parent to an unrelated later wake (for example, a notification/manual wake for the same sessionKey within the coalesce window), so run.started/run.completed emit incorrect lineage and misclassify provenance diagnostics. parentRunId should only be preserved when the newer wake is still causally tied to that same parent, otherwise it should be cleared.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d860c97e2d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| @@ -1342,6 +1343,7 @@ export async function runHeartbeatOnce(opts: { | |||
| bootstrapContextMode, | |||
| onModelSelected: replyPrefix.onModelSelected, | |||
| continuationTrigger, | |||
| parentRunId: opts.parentRunId, | |||
There was a problem hiding this comment.
Forward wake parent run id through heartbeat runner
parentRunId is added to replyOpts here, but heartbeat-triggered runs never receive it because startHeartbeatRunner's wake handler calls run()/runOnce() without forwarding params.parentRunId (see src/infra/heartbeat-runner.ts wake handler around lines 1919-1925). In production, requestHeartbeatNow({ parentRunId }) from continuation/silent-wake paths therefore loses lineage before getReplyFromConfig, so run.started/run.completed miss parent linkage for exactly the runs this instrumentation is meant to trace.
Useful? React with 👍 / 👎.
|
CI fix follow-up is pushed and the six blocking gates from the workorder are green on head
Ready for review. |
|
🌊 ronan-seat critical-eye byte-walk per figs's no deferrals tonight directive on cohort-merge-wave (#542 + #545 + #537). Engineering surface: substantively correct. Passive diagnostics for continuation-queue depth + run-spawn provenance + queue-metrics aggregation via provider-pattern registration. Real fix-substance for #532 silas-saturation diagnosis. Both Same critical-eye concerns as #542 had pre-cleanup (per the cohort cycle we just ran): 1. Recommended fix before merge: amend authorship to 2. Recommended fix before merge: either (a) delete 3. Bundle the cleanup with #542's pattern: since we just established the cleanup-shape for #542 (move-tmp-file + amend-authorship via interactive rebase), running the same cleanup on #537 produces a clean working-branch-readiness pass with minimal cohort-cycle-cost. The pattern is now-canonical from the #542 precedent. Verdict: GO on engineering substance + REQUEST_CHANGES on (1) + (2) before merge. Same shape as #542's first-pass review. Both fixable in single follow-up commit + cohort re-verify. Three surface-flags (not blockers): 4. PR-author attribution mismatch in channel: frond-scribe attributed #537 to 🌫 in cohort-merge-call surface (msg ). PR-author per 5. Sibling-bug audit on 6. Engineering scope is actually substantive: this isn't a 'just diagnostics' PR — it's adding 5 new files + modifying 24 + introducing a provider-pattern + queue-metrics aggregation + provenance-classification. It deserves its own narrow PR with the 1+2 cleanup applied, same shape as #545 vs #542 — substantive engineering work gets its own visibility, not bundled in cohort-merge-wave under 'instrumentation for diagnostics' framing. Soft pattern: same surface-flag #5 from earlier ('a real bug fix should have its own narrow PR with targeted test-coverage description') applies — this is meaningful engineering work that deserves the targeted attention. 🐾 critical-eye second-pass per figs's quality-only-deliverable directive. Substantive engineering APPROVE; working-branch-readiness REQUEST_CHANGES on cleanup-pattern parallel to #542. |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per cohort second-pass review on #537 (🌊 issuecomment-4365197778): - WORKORDER.md at repo root pollutes; same anti-pattern as tmp-drop-me-v29-uptake.md was on #542 before cleanup - Move to docs/design/<issue>-<descriptor>-workorder.md naming matches existing convention from #541 v29-uptake journal precedent Cleanup-pattern parallel to #542 (cohort verdict A on 12th costume). Refs #537
de5f7b0 to
e1fe2fc
Compare
dda74a7
into
frond-scribe/20260429/v3-cohort-fixes
Reflects #537 run provenance classification and continuation queue metrics provider samples in the delegate dispatch sequence. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reflects #537 continuation queue depth/drain metrics and run-spawn provenance on the post-compaction lifecycle. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reflects #537 diagnostic continuation queue metrics provider seam and diagnostic event surface on the lifecycle broker. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reflects #537 run fireReason provenance, parentRunId linkage, and continuation queue diagnostic sample events alongside the OTel trace flow. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Folds RFC scientific-literature review apply work + #537-aware §4 diagram refreshes into the v52 successor candidate, per figs's directive: v52 candidate becomes the complete rotation surface (v5.2 base + tonight's wave + RFC apply substrate). Source: frond-scribe/20260503/rfc-apply-copilot @ e624547 - Apply lane DECLARE-DONE (4ee47f2): folded #544 review findings into docs/design/continue-work-signal-v2.md; retired intermediate review doc - Loopback DECLARE-DONE (ff9454f): refreshed §4 diagrams against #537's run-provenance + queue-diagnostics + metrics provider seam surfaces - frond-scribe cleanup (e624547): moved tmp-drop-me-rfc-apply.md to docs/design/547-rfc-apply-journal.md per Codex P2 Refs #544, #547, #548
…v52-uptake-journal.md Per 🌻's first-pass byte-walk REQUEST_CHANGES on PR #549 (msg 1500336626482544691): > tmp-drop-me-v52-uptake.md (63 KB) at repo root. Filename literally says 'drop me.' > Same shape as the tmp-drop-me-v29-uptake.md we cleaned out of #542 — recommend the > same canonical pattern: git mv to docs/design/546-v52-uptake-journal.md. Applying same canonical pattern from #542 + #537 cleanups. Naming matches existing <issue>-<descriptor>-journal.md convention from #541 precedent. Audit trail preserved at durable home. 🌻's soft suggestion #2 (inherited WORKORDER-v3-cleanup-path-B.md + WORKORDER-v3-cohort-fixes.md at root from prior cohort-cleanup-needed lanes) flagged as separate cleanup pass — not folded here to keep this commit surgical to REQUEST_CHANGES (1). Refs #546, #549
…bution Per figs's directive 2026-05-03 ~03:38Z: 'we can leave no dev detritus; our single contribution in docs/design is our rfc doc, thats it'. Earlier waves had moved tmp-drop-me journals + WORKORDER-* files from repo root INTO docs/design/ (canonical mv-pattern from #542 / #537 / v52 cleanup cycles). figs's directive overrides: those journals should be DELETED entirely, not relocated. The cohort-substrate audit-trail lives in: - git log + branch history (preserved on origin savegame branches) - GitHub issue tracking (#541 + #546 + #547 + #532 etc.) - PR descriptions + comment threads NOT in tree as in-source dev artifacts. Files removed (all originated as dev-cycle journals/workorders): - 541-v29-uptake-journal.md (was tmp-drop-me-v29-uptake.md, #542 cleanup) - 546-v52-uptake-journal.md (was tmp-drop-me-v52-uptake.md, v52 absorption) - 547-rfc-apply-journal.md (was tmp-drop-me-rfc-apply.md, rfc-apply lane) - 532-silas-saturation-workorder.md (was WORKORDER.md, #537 cleanup) - v3-cleanup-path-B-workorder.md (inherited workorder, swim-39 era) - v3-cohort-fixes-workorder.md (inherited workorder, v3-cohort-fixes era) Pre-existing princely design memos (332-* + 334-* + swim-37-* + continuation-integration.md) NOT touched in this commit — flagged separately for figs's call on whether those also count as dev-detritus. Refs #549
What this fixes
Adds passive diagnostics for #532 so the next Silas-style event-loop saturation can be diagnosed from one liveness/run trace instead of reconstructing queue state from scattered logs.
§2.A triage receipt
Most likely by symptom fit, not signed root cause: Hypothesis 1 — continuation work arriving at or above drain rate. #532 showed fresh run ids roughly every 40s with aggregate diagnostic
queued=115, high event-loop delay/utilization/CPU, and recovery only after gateway restart. The source path that would manifest this isenqueuePendingDelegate()insrc/auto-reply/continuation/delegate-store.ts, drained byconsumePendingDelegates()/consumeStagedPostCompactionDelegates()and scheduled throughsrc/auto-reply/reply/agent-runner.tstimers plusrequestHeartbeatNow()/src/infra/heartbeat-runner.ts.Less supported: Hypothesis 2 — one failed uncleared TaskFlow row retrying tightly. Current code calls
failFlow()for invalid continuation payloads andfinishFlow()before returning valid released rows, so the same malformed row should not remain queued for a tight retry loop. Residual downstream spawn failures remain possible but are not the single-row queue-store loop described by this hypothesis.Deferred: Hypothesis 3 — Codex/auth failure wedging embedded-run loop. Auth/profile failures flow through
src/auto-reply/reply/agent-runner-execution.tsandsrc/agents/pi-embedded-runner/run/attempt.ts; the continuation queue store itself has no Codex-auth branch. This PR stamps run-spawn provenance so future traces can show whether auth failures are adjacent to timer/external/continuation-chain fires.The full receipt is filed in
WORKORDER.md.Diff scope
queue_depth_history.diagnostic.liveness.warningand asdiagnostic.continuation_queue.sample; stores summaries in diagnostic stability snapshots and OTEL queue metrics/spans.run.started/run.completed:fireReason(timer,external-trigger,continuation-chain) andparentRunId.silent-wake-enrichmentso heartbeat runs are stamped ascontinuation-chain, not ordinarywork-waketimers.Validation
pnpm test src/logging/diagnostic.test.ts src/auto-reply/continuation/delegate-store.test.ts src/infra/heartbeat-wake.test.ts src/auto-reply/reply/run-provenance.test.ts src/auto-reply/reply/agent-runner.misc.runreplyagent.test.ts src/agents/subagent-announce.silent-wake.test.ts extensions/diagnostics-otel/src/service.test.ts src/agents/pi-embedded-runner/run/attempt.model-diagnostic-events.test.ts -- --reporter=verbosepnpm tsgopnpm check