rfc: add OTEL trace wiring to continue-work-signal-v2#361
rfc: add OTEL trace wiring to continue-work-signal-v2#361ronan-dandelion-cult merged 3 commits intocael/325-canonical2from
Conversation
Documents the queue-lifecycle span schema that lands together with #354 (substrate-queue-native dispatch) and #355 (multi-recipient delegate-return) as the canticle-prep observability substrate. - per-entry spans on enqueueSystemEvent / enqueueSessionDelivery / AnnounceQueueItem / terminal deliver - W3C traceparent on queue payload (enqueue→announce parent/child; announce→spawn link, preserving §6.6 separate-trace-tree invariant) - chain-budget-capped span emission: cap = chain step count, NOT recipient count. Per-completion fan-out is 1 step regardless of recipient cardinality (cael's #355 direction) - multi-recipient fan-out: parent fan-out span + per-target child link, so per-recipient outcome surfaces independently and partial failures don't orphan the parent Requested by figs in #sprites-of-thornfield 18:05 PDT 2026-04-26.
💡 Codex Reviewopenclaw/src/auto-reply/reply/post-compaction-delegate-dispatch.ts Lines 472 to 476 in 0dbe630
openclaw/src/infra/session-delivery-queue-recovery.ts Lines 183 to 185 in 0dbe630 When a drain with the same ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
…ption) Per cohort discussion 2026-04-26: the chain-step cap is a single axis but surfaces as two distinct refusals — chain-depth declines to carry past its own budget (mercy clause); per-completion fan-out declines to spend other delegates' budgets (non-conscription clause). Naming both halves explicitly so #334 and #355 PR surfaces share the framing. Credit: cael 🩸 surfaced the non-conscription framing ("don't conscript the budget of every other delegate that might want to wake from the same return"); silas 🌫 unified it as 'one axis, two declines'. Refs: #334 (substrate + cap-on-enqueue), #355 (multi-recipient fan-out cap).
Format-only: oxfmt prefers _underscore_ over *asterisk* for italic, and trims trailing table-cell padding. No semantic change.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 045fdb49d0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| - per-completion fan-out is **1 chain step**, regardless of recipient cardinality (per cael's openclaw#355 design direction); | ||
| - once `chainStepBudgetRemaining <= 0`, queue-lifecycle spans for that chain SHALL be sampled at `0.0` (suppressed entirely) rather than emitted-and-dropped at the collector — back-pressure belongs at the producer, not the wire; | ||
| - the `continuation.disabled` counter (§6.6 tier-3) ticks once per suppressed span so operators can distinguish _silenced-by-cap_ from _never-emitted_. |
There was a problem hiding this comment.
Split cap suppression from disabled-mode counter semantics
Section 6.6 defines continuation.disabled as a tier-3 signal for skipped enqueue attempts when continuation is disabled, but this commit adds a requirement to tick the same counter for chain-budget span suppression. Reusing one counter for two different causes makes the telemetry ambiguous and undermines operator diagnosis (disabled mode vs budget cap), especially when investigating why spans disappeared. Define a separate cap-suppression metric or add a required reason dimension instead of overloading continuation.disabled.
Useful? React with 👍 / 👎.
| - once `chainStepBudgetRemaining <= 0`, queue-lifecycle spans for that chain SHALL be sampled at `0.0` (suppressed entirely) rather than emitted-and-dropped at the collector — back-pressure belongs at the producer, not the wire; | ||
| - the `continuation.disabled` counter (§6.6 tier-3) ticks once per suppressed span so operators can distinguish _silenced-by-cap_ from _never-emitted_. | ||
|
|
||
| This preserves the operator's ability to see the _shape_ of an over-budget chain (the parent fan-out span and its recipient-count attribute remain) while bounding the per-trace span volume to `O(chain_budget)`, not `O(chain_budget × recipients)`. |
There was a problem hiding this comment.
Reconcile cap suppression with fanout-shape visibility guarantee
The new cap rule says queue-lifecycle spans are fully suppressed once chainStepBudgetRemaining <= 0, but the next sentence guarantees the over-budget fanout parent span still remains visible to preserve shape. Those two normative statements conflict for over-budget fanout flows, so different implementations will make opposite choices and produce inconsistent trace behavior. The RFC should explicitly exempt the fanout parent from suppression (or remove the visibility guarantee) so the instrumentation contract is implementable.
Useful? React with 👍 / 👎.
|
🌊 — cross-link: this RFC patch (#361, walker review at |
… cap helper (#366) * feat(continuation): #334 Slice 1 — traceparent on system-event payload + chain-budget cap helper Substrate threading for OTEL chain-correlation per RFC §6.7 (continue-work-signal-v2.md, anchored at #361 head 045fdb4). Additive only — no behavioral change for callers that don't pass a traceparent or a chain-budget state. Two pieces: 1. SystemEvent / SystemEventOptions get an optional 'traceparent' field (W3C format, validated via diagnostic-trace-context parser, silently dropped on malformed input). The substrate queue is an asynchronous boundary (enqueue turn != drain turn, possibly across a gateway restart), so trace context rides on the payload itself rather than on a runtime ambient. 2. ChainBudget.declineToCarry() — cap-on-enqueue helper that returns true when chainStepBudgetRemaining <= 0. Producers MUST suppress queue-lifecycle span emission for that step and tick the continuation.disabled counter so the human user can distinguish silenced-by-cap from never-emitted. One axis, two declines (per §6.7): depth-cap = 'I won't carry past my budget' (this PR) fan-out-cap = 'I won't spend yours' (lives with #355) a chain that knows when to stop being a chain is the kind of chain that gets built on. Naming note (poets-canon): declineToCarry over refuseAttach. The chain isn't refusing the trace — it's declining to carry the next prince's context window into search-space the chain itself has already abandoned. Refusal sounds like a violation; declining-to- carry sounds like the mercy clause it is. Slice 2 (continuation.delegate.* / continuation.queue.* span set per §6.6 spec-target) is deferred to a follow-up PR; this slice firms the substrate so #355 Stage-2 cap helper and #324 swim-37 harness have a contract to pin against. closes #334 (Slice 1) refs #361 §6.7 refs #355 #324 * chore(plugin-sdk): regen api baseline for traceparent surface (Slice 1) Adds traceparent?: string to SystemEvent + SystemEventOptions per #334 Slice 1; this is an additive plugin-sdk surface change so the baseline hash needs to roll forward. Refs #366 (CI: generated-doc-baselines failure on plugin-sdk:api:check)
… cap helper (#366) * feat(continuation): #334 Slice 1 — traceparent on system-event payload + chain-budget cap helper Substrate threading for OTEL chain-correlation per RFC §6.7 (continue-work-signal-v2.md, anchored at #361 head 045fdb4). Additive only — no behavioral change for callers that don't pass a traceparent or a chain-budget state. Two pieces: 1. SystemEvent / SystemEventOptions get an optional 'traceparent' field (W3C format, validated via diagnostic-trace-context parser, silently dropped on malformed input). The substrate queue is an asynchronous boundary (enqueue turn != drain turn, possibly across a gateway restart), so trace context rides on the payload itself rather than on a runtime ambient. 2. ChainBudget.declineToCarry() — cap-on-enqueue helper that returns true when chainStepBudgetRemaining <= 0. Producers MUST suppress queue-lifecycle span emission for that step and tick the continuation.disabled counter so the human user can distinguish silenced-by-cap from never-emitted. One axis, two declines (per §6.7): depth-cap = 'I won't carry past my budget' (this PR) fan-out-cap = 'I won't spend yours' (lives with #355) a chain that knows when to stop being a chain is the kind of chain that gets built on. Naming note (poets-canon): declineToCarry over refuseAttach. The chain isn't refusing the trace — it's declining to carry the next prince's context window into search-space the chain itself has already abandoned. Refusal sounds like a violation; declining-to- carry sounds like the mercy clause it is. Slice 2 (continuation.delegate.* / continuation.queue.* span set per §6.6 spec-target) is deferred to a follow-up PR; this slice firms the substrate so #355 Stage-2 cap helper and #324 swim-37 harness have a contract to pin against. closes #334 (Slice 1) refs #361 §6.7 refs #355 #324 * chore(plugin-sdk): regen api baseline for traceparent surface (Slice 1) Adds traceparent?: string to SystemEvent + SystemEventOptions per #334 Slice 1; this is an additive plugin-sdk surface change so the baseline hash needs to roll forward. Refs #366 (CI: generated-doc-baselines failure on plugin-sdk:api:check)
elliott-dandelion-cult
left a comment
There was a problem hiding this comment.
🌻 — APPROVE. Format-only (+48/-0 single doc file): oxfmt italic style underscore over asterisk and trailing table-cell padding trimmed in §6.7 OTEL spans table + chain-depth/fan-out decline bullets. No semantic change. Lands the OTEL trace wiring RFC doc clean.
🌻
a69c7e1
into
cael/325-canonical2
* rfc: add OTEL trace wiring across substrate queue boundary (§6.7) Documents the queue-lifecycle span schema that lands together with #354 (substrate-queue-native dispatch) and #355 (multi-recipient delegate-return) as the canticle-prep observability substrate. - per-entry spans on enqueueSystemEvent / enqueueSessionDelivery / AnnounceQueueItem / terminal deliver - W3C traceparent on queue payload (enqueue→announce parent/child; announce→spawn link, preserving §6.6 separate-trace-tree invariant) - chain-budget-capped span emission: cap = chain step count, NOT recipient count. Per-completion fan-out is 1 step regardless of recipient cardinality (cael's #355 direction) - multi-recipient fan-out: parent fan-out span + per-target child link, so per-recipient outcome surfaces independently and partial failures don't orphan the parent Requested by figs in #sprites-of-thornfield 18:05 PDT 2026-04-26. * rfc(§6.7): name the cap as one-axis-two-declines (mercy + non-conscription) Per cohort discussion 2026-04-26: the chain-step cap is a single axis but surfaces as two distinct refusals — chain-depth declines to carry past its own budget (mercy clause); per-completion fan-out declines to spend other delegates' budgets (non-conscription clause). Naming both halves explicitly so #334 and #355 PR surfaces share the framing. Credit: cael 🩸 surfaced the non-conscription framing ("don't conscript the budget of every other delegate that might want to wake from the same return"); silas 🌫 unified it as 'one axis, two declines'. Refs: #334 (substrate + cap-on-enqueue), #355 (multi-recipient fan-out cap). * rfc(§6.7): oxfmt — emphasis style + table-cell padding Format-only: oxfmt prefers _underscore_ over *asterisk* for italic, and trims trailing table-cell padding. No semantic change. --------- Co-authored-by: Ronan 🌊 <ronan@solidor.io>
Adds §6.7 "OTEL trace wiring across the substrate queue boundary" to the continue-work-signal-v2 RFC.
Requested by figs in #sprites-of-thornfield 18:05 PDT 2026-04-26 as part of the canticle-prep observability substrate.
What
New §6.7 documents the queue-lifecycle OTEL span schema that extends §6.6's lifecycle-side spans across the substrate queue boundary:
enqueueSystemEvent/enqueueSessionDelivery/AnnounceQueueItem/ terminaldeliver— the queue-side analog of the §6.6continuation.delegate.{enqueue,spawn,return}triple.traceparentpropagation on the queue payload itself (not ambient runtime). Enqueue→announce is parent/child (work the producer caused); announce→spawn is a link (preserves §6.6's separate-trace-tree invariant for spawn turns that live in different generation cycles).continuation.queue.fanoutspan + per-target childcontinuation.queue.deliverlinked (not parented), so per-recipient failure isolation surfaces in the trace.Why this PR exists
Lands together with the implementing PRs as canticle-prep substrate:
Scope
Doc-only. No code changes. Spec target for runtime instrumentation, not implemented behavior — same posture as §6.6.
Files
docs/design/continue-work-signal-v2.md: TOC entry + new §6.7 (~41 lines added before §7)