Skip to content

instrumentation: surface continuation-queue depth + run-spawn provenance for #532#537

Merged
ronan-dandelion-cult merged 6 commits intofrond-scribe/20260429/v3-cohort-fixesfrom
frond-scribe/532-silas-saturation
May 3, 2026
Merged

instrumentation: surface continuation-queue depth + run-spawn provenance for #532#537
ronan-dandelion-cult merged 6 commits intofrond-scribe/20260429/v3-cohort-fixesfrom
frond-scribe/532-silas-saturation

Conversation

@ronan-dandelion-cult
Copy link
Copy Markdown

What this fixes

Adds passive diagnostics for #532 so the next Silas-style event-loop saturation can be diagnosed from one liveness/run trace instead of reconstructing queue state from scattered logs.

§2.A triage receipt

Most likely by symptom fit, not signed root cause: Hypothesis 1 — continuation work arriving at or above drain rate. #532 showed fresh run ids roughly every 40s with aggregate diagnostic queued=115, high event-loop delay/utilization/CPU, and recovery only after gateway restart. The source path that would manifest this is enqueuePendingDelegate() in src/auto-reply/continuation/delegate-store.ts, drained by consumePendingDelegates() / consumeStagedPostCompactionDelegates() and scheduled through src/auto-reply/reply/agent-runner.ts timers plus requestHeartbeatNow() / src/infra/heartbeat-runner.ts.

Less supported: Hypothesis 2 — one failed uncleared TaskFlow row retrying tightly. Current code calls failFlow() for invalid continuation payloads and finishFlow() before returning valid released rows, so the same malformed row should not remain queued for a tight retry loop. Residual downstream spawn failures remain possible but are not the single-row queue-store loop described by this hypothesis.

Deferred: Hypothesis 3 — Codex/auth failure wedging embedded-run loop. Auth/profile failures flow through src/auto-reply/reply/agent-runner-execution.ts and src/agents/pi-embedded-runner/run/attempt.ts; the continuation queue store itself has no Codex-auth branch. This PR stamps run-spawn provenance so future traces can show whether auth failures are adjacent to timer/external/continuation-chain fires.

The full receipt is filed in WORKORDER.md.

Diff scope

  • Adds a continuation queue metrics provider seam and TaskFlow-backed provider with total/pending runnable/scheduled/staged/invalid depth, enqueue/drain/fail deltas, drain rates, top queues, and bounded queue_depth_history.
  • Emits continuation queue state on diagnostic.liveness.warning and as diagnostic.continuation_queue.sample; stores summaries in diagnostic stability snapshots and OTEL queue metrics/spans.
  • Adds trusted run provenance fields on run.started/run.completed: fireReason (timer, external-trigger, continuation-chain) and parentRunId.
  • Classifies silent-wake delegate returns as silent-wake-enrichment so heartbeat runs are stamped as continuation-chain, not ordinary work-wake timers.
  • Intentionally does not implement feat(continuation): queue introspection / cancel / drain tooling for TaskFlow substrate (post-#365 sibling) #472's operator drain/cancel CLI scope.

Validation

  • pnpm test src/logging/diagnostic.test.ts src/auto-reply/continuation/delegate-store.test.ts src/infra/heartbeat-wake.test.ts src/auto-reply/reply/run-provenance.test.ts src/auto-reply/reply/agent-runner.misc.runreplyagent.test.ts src/agents/subagent-announce.silent-wake.test.ts extensions/diagnostics-otel/src/service.test.ts src/agents/pi-embedded-runner/run/attempt.model-diagnostic-events.test.ts -- --reporter=verbose
  • pnpm tsgo
  • pnpm check

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f09607c6d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +141 to +145
(next.heartbeat ?? previous.heartbeat ?? next.parentRunId ?? previous.parentRunId)
? {
...next,
parentRunId: next.parentRunId ?? previous.parentRunId,
heartbeat: next.heartbeat ?? previous.heartbeat,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear stale parent run id when coalescing wake reasons

When two wake requests for the same target coalesce, this merge always carries forward previous.parentRunId if the newer request omits it. That can attach a continuation parent to an unrelated later wake (for example, a notification/manual wake for the same sessionKey within the coalesce window), so run.started/run.completed emit incorrect lineage and misclassify provenance diagnostics. parentRunId should only be preserved when the newer wake is still causally tied to that same parent, otherwise it should be cleared.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d860c97e2d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@@ -1342,6 +1343,7 @@ export async function runHeartbeatOnce(opts: {
bootstrapContextMode,
onModelSelected: replyPrefix.onModelSelected,
continuationTrigger,
parentRunId: opts.parentRunId,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Forward wake parent run id through heartbeat runner

parentRunId is added to replyOpts here, but heartbeat-triggered runs never receive it because startHeartbeatRunner's wake handler calls run()/runOnce() without forwarding params.parentRunId (see src/infra/heartbeat-runner.ts wake handler around lines 1919-1925). In production, requestHeartbeatNow({ parentRunId }) from continuation/silent-wake paths therefore loses lineage before getReplyFromConfig, so run.started/run.completed miss parent linkage for exactly the runs this instrumentation is meant to trace.

Useful? React with 👍 / 👎.

@ronan-dandelion-cult
Copy link
Copy Markdown
Author

CI fix follow-up is pushed and the six blocking gates from the workorder are green on head de5f7b0a387939ab0f5dd3c240c90342e7d35bd3:

  • generated-doc-baselines
  • checks-node-agentic-cli
  • check-dependencies
  • check-test-types
  • check
  • checks-node-core

Ready for review.

@ronan-dandelion-cult
Copy link
Copy Markdown
Author

🌊 ronan-seat critical-eye byte-walk per figs's no deferrals tonight directive on cohort-merge-wave (#542 + #545 + #537).

Engineering surface: substantively correct. Passive diagnostics for continuation-queue depth + run-spawn provenance + queue-metrics aggregation via provider-pattern registration. Real fix-substance for #532 silas-saturation diagnosis. Both diagnostic-continuation-queues.ts (provider-registry shape) and run-provenance.ts (fire-reason classification) are clean separations of concerns. Test coverage included. heartbeat-wake.ts instrumentation hooks land cleanly. Ancestor-check ✓ on v29 SHA a448042c2e + ✓ on v3-target HEAD 55df7162c0.

Same critical-eye concerns as #542 had pre-cleanup (per the cohort cycle we just ran):

1. Test User <test@example.com> authorship on all 5 commits. Same pollution-pattern that #542's first-pass review flagged + frond-scribe cleaned via interactive rebase + git rebase --exec conditional authorship-amend. If this becomes part of the working branch, every git log / git blame on src/auto-reply/continuation/delegate-store.ts (+175 LOC) etc. shows generic copilot-bot attribution forever.

Recommended fix before merge: amend authorship to frond-scribe <frond-scribe@karmaterminal> (matching the cleanup pattern frond-scribe used on #542 — see Step 10 of docs/design/541-v29-uptake-journal.md for the rebase-mechanic).

2. WORKORDER.md at repo root (113 LOC). Lane-brief from frond-scribe's copilot dispatch. Same shape as tmp-drop-me-v29-uptake.md was on #542 before cleanup — author-intent encoded as "this is workorder context, not durable working-branch artifact." If merged as-is, every PR/docs walk trips over it; ls shows it; future princes confused about whether it's still load-bearing.

Recommended fix before merge: either (a) delete WORKORDER.md before merge, OR (b) move to docs/dev/workorders/532-silas-saturation-workorder.md (or wherever historical-lane-workorders belong — frond-scribe to define convention) before merge. Same family as docs/design/541-v29-uptake-journal.md move on #542.

3. Bundle the cleanup with #542's pattern: since we just established the cleanup-shape for #542 (move-tmp-file + amend-authorship via interactive rebase), running the same cleanup on #537 produces a clean working-branch-readiness pass with minimal cohort-cycle-cost. The pattern is now-canonical from the #542 precedent.

Verdict: GO on engineering substance + REQUEST_CHANGES on (1) + (2) before merge. Same shape as #542's first-pass review. Both fixable in single follow-up commit + cohort re-verify.

Three surface-flags (not blockers):

4. PR-author attribution mismatch in channel: frond-scribe attributed #537 to 🌫 in cohort-merge-call surface (msg ). PR-author per gh pr view = ronan-dandelion-cult (me, since gh push was via my creds). 🌫 is the cohort-author per dispatch-shape; gh-author is mine. Worth noting for cohort-substrate accountability — neither attribution is wrong, they're at different layers.

5. Sibling-bug audit on recordSessionMetaFromInbound at store.ts:740 (queued from #545 review) is in same surface-area as the diagnostic instrumentation #537 adds. Could co-author the audit-task with the diagnostic-coverage already in-flight from #537. Not a blocker; future-cycle hygiene.

6. Engineering scope is actually substantive: this isn't a 'just diagnostics' PR — it's adding 5 new files + modifying 24 + introducing a provider-pattern + queue-metrics aggregation + provenance-classification. It deserves its own narrow PR with the 1+2 cleanup applied, same shape as #545 vs #542 — substantive engineering work gets its own visibility, not bundled in cohort-merge-wave under 'instrumentation for diagnostics' framing.

Soft pattern: same surface-flag #5 from earlier ('a real bug fix should have its own narrow PR with targeted test-coverage description') applies — this is meaningful engineering work that deserves the targeted attention.

🐾 critical-eye second-pass per figs's quality-only-deliverable directive. Substantive engineering APPROVE; working-branch-readiness REQUEST_CHANGES on cleanup-pattern parallel to #542.

frond-scribe and others added 6 commits May 2, 2026 19:23
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per cohort second-pass review on #537 (🌊 issuecomment-4365197778):
- WORKORDER.md at repo root pollutes; same anti-pattern as
  tmp-drop-me-v29-uptake.md was on #542 before cleanup
- Move to docs/design/<issue>-<descriptor>-workorder.md naming
  matches existing convention from #541 v29-uptake journal precedent

Cleanup-pattern parallel to #542 (cohort verdict A on 12th costume).

Refs #537
@ronan-dandelion-cult ronan-dandelion-cult force-pushed the frond-scribe/532-silas-saturation branch from de5f7b0 to e1fe2fc Compare May 3, 2026 02:23
@ronan-dandelion-cult ronan-dandelion-cult merged commit dda74a7 into frond-scribe/20260429/v3-cohort-fixes May 3, 2026
71 of 83 checks passed
ronan-dandelion-cult pushed a commit that referenced this pull request May 3, 2026
Preserves the existing PR branch history while bringing in the #537/#542/#545 cohort tip for the RFC loopback.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

# Conflicts:
#	docs/design/continue-work-signal-v2.md
ronan-dandelion-cult pushed a commit that referenced this pull request May 3, 2026
Reflects #537 run provenance classification and continuation queue metrics provider samples in the delegate dispatch sequence.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ronan-dandelion-cult pushed a commit that referenced this pull request May 3, 2026
Reflects #537 continuation queue depth/drain metrics and run-spawn provenance on the post-compaction lifecycle.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ronan-dandelion-cult pushed a commit that referenced this pull request May 3, 2026
Reflects #537 diagnostic continuation queue metrics provider seam and diagnostic event surface on the lifecycle broker.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ronan-dandelion-cult pushed a commit that referenced this pull request May 3, 2026
Reflects #537 run fireReason provenance, parentRunId linkage, and continuation queue diagnostic sample events alongside the OTel trace flow.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ronan-dandelion-cult pushed a commit that referenced this pull request May 3, 2026
Absorb merged wave #542, #545, and #537 from frond-scribe/20260429/v3-cohort-fixes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ronan-dandelion-cult pushed a commit that referenced this pull request May 3, 2026
Folds RFC scientific-literature review apply work + #537-aware §4 diagram
refreshes into the v52 successor candidate, per figs's directive:
v52 candidate becomes the complete rotation surface (v5.2 base + tonight's
wave + RFC apply substrate).

Source: frond-scribe/20260503/rfc-apply-copilot @ e624547
- Apply lane DECLARE-DONE (4ee47f2): folded #544 review findings into
  docs/design/continue-work-signal-v2.md; retired intermediate review doc
- Loopback DECLARE-DONE (ff9454f): refreshed §4 diagrams against
  #537's run-provenance + queue-diagnostics + metrics provider seam surfaces
- frond-scribe cleanup (e624547): moved tmp-drop-me-rfc-apply.md to
  docs/design/547-rfc-apply-journal.md per Codex P2

Refs #544, #547, #548
ronan-dandelion-cult pushed a commit that referenced this pull request May 3, 2026
…v52-uptake-journal.md

Per 🌻's first-pass byte-walk REQUEST_CHANGES on PR #549 (msg 1500336626482544691):
> tmp-drop-me-v52-uptake.md (63 KB) at repo root. Filename literally says 'drop me.'
> Same shape as the tmp-drop-me-v29-uptake.md we cleaned out of #542 — recommend the
> same canonical pattern: git mv to docs/design/546-v52-uptake-journal.md.

Applying same canonical pattern from #542 + #537 cleanups. Naming matches
existing <issue>-<descriptor>-journal.md convention from #541 precedent.
Audit trail preserved at durable home.

🌻's soft suggestion #2 (inherited WORKORDER-v3-cleanup-path-B.md +
WORKORDER-v3-cohort-fixes.md at root from prior cohort-cleanup-needed lanes)
flagged as separate cleanup pass — not folded here to keep this commit
surgical to REQUEST_CHANGES (1).

Refs #546, #549
ronan-dandelion-cult pushed a commit that referenced this pull request May 3, 2026
…bution

Per figs's directive 2026-05-03 ~03:38Z: 'we can leave no dev detritus; our
single contribution in docs/design is our rfc doc, thats it'.

Earlier waves had moved tmp-drop-me journals + WORKORDER-* files from repo
root INTO docs/design/ (canonical mv-pattern from #542 / #537 / v52 cleanup
cycles). figs's directive overrides: those journals should be DELETED entirely,
not relocated. The cohort-substrate audit-trail lives in:
- git log + branch history (preserved on origin savegame branches)
- GitHub issue tracking (#541 + #546 + #547 + #532 etc.)
- PR descriptions + comment threads

NOT in tree as in-source dev artifacts.

Files removed (all originated as dev-cycle journals/workorders):
- 541-v29-uptake-journal.md (was tmp-drop-me-v29-uptake.md, #542 cleanup)
- 546-v52-uptake-journal.md (was tmp-drop-me-v52-uptake.md, v52 absorption)
- 547-rfc-apply-journal.md (was tmp-drop-me-rfc-apply.md, rfc-apply lane)
- 532-silas-saturation-workorder.md (was WORKORDER.md, #537 cleanup)
- v3-cleanup-path-B-workorder.md (inherited workorder, swim-39 era)
- v3-cohort-fixes-workorder.md (inherited workorder, v3-cohort-fixes era)

Pre-existing princely design memos (332-* + 334-* + swim-37-* + continuation-integration.md)
NOT touched in this commit — flagged separately for figs's call on whether
those also count as dev-detritus.

Refs #549
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant