trace: Correlate channel message diagnostics into one trace#88821
Conversation
|
Codex review: needs maintainer review before merge. Reviewed June 1, 2026, 4:26 PM ET / 20:26 UTC. Summary PR surface: Source +410, Tests +1179, Generated 0. Total +1589 across 12 files. Reproducibility: yes. at source level: current docs promise trace inheritance, while current main does not create a channel-message trace scope and diagnostics-otel starts message.processed only at completion. I did not run a live Slack reproduction in this read-only review. Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land the shared channel/harness/Otel trace fix after owner acceptance of the qa-channel OTLP proof and the new test-runtime helper surface; otherwise narrow the proof or export requirements before merge. Do we have a high-confidence way to reproduce the issue? Yes at source level: current docs promise trace inheritance, while current main does not create a channel-message trace scope and diagnostics-otel starts message.processed only at completion. I did not run a live Slack reproduction in this read-only review. Is this the best way to solve the issue? Yes, this appears to be the right shared-layer fix rather than a Slack-only workaround: it scopes the channel kernel, harness lifecycle, and OTel exporter together. The remaining question is owner acceptance of the proof threshold and public test helper export. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against fe97c6000c5c. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +410, Tests +1179, Generated 0. Total +1589 across 12 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
5c9b833 to
31d588e
Compare
8719ff6 to
2d0b308
Compare
|
@clawsweeper review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
To resolve #88811 |
7004b47 to
cf097ae
Compare
|
@clawsweeper review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
I don’t think the Clawsweeper item is a real P1 as stated. The normal/success waterfall path is covered and working:
Reran the focused trace-parenting tests: 2 files passed, 13 focused tests passed, 103 skipped |
ef11443 to
1ab07ca
Compare
1ab07ca to
04e9189
Compare
|
Land-ready proof after maintainer refactor. Summary:
Local proof on head
CI proof:
Known proof gap:
|
Refactor the agent harness surface after PR #88821 by moving compaction dispatch into its own module, splitting the harness type into explicit capability interfaces, and renaming the private agent-core class declaration to `CoreAgentHarness` while preserving the exported `AgentHarness` contract. Verification: - `node scripts/run-vitest.mjs src/agents/harness/selection.test.ts src/agents/command/cli-compaction.test.ts src/agents/embedded-agent-runner/compact.hooks.test.ts packages/agent-core/src/agent-loop.test.ts packages/agent-core/src/harness/messages.test.ts` - `pnpm build` - autoreview clean - `pnpm check:changed` passed on Testbox `tbx_01kt407hq8sv1csm287pdj3fmp` - PR CI merge state `CLEAN`
Refactor the agent harness surface after PR openclaw#88821 by moving compaction dispatch into its own module, splitting the harness type into explicit capability interfaces, and renaming the private agent-core class declaration to `CoreAgentHarness` while preserving the exported `AgentHarness` contract. Verification: - `node scripts/run-vitest.mjs src/agents/harness/selection.test.ts src/agents/command/cli-compaction.test.ts src/agents/embedded-agent-runner/compact.hooks.test.ts packages/agent-core/src/agent-loop.test.ts packages/agent-core/src/harness/messages.test.ts` - `pnpm build` - autoreview clean - `pnpm check:changed` passed on Testbox `tbx_01kt407hq8sv1csm287pdj3fmp` - PR CI merge state `CLEAN`
Refactor the agent harness surface after PR openclaw#88821 by moving compaction dispatch into its own module, splitting the harness type into explicit capability interfaces, and renaming the private agent-core class declaration to `CoreAgentHarness` while preserving the exported `AgentHarness` contract. Verification: - `node scripts/run-vitest.mjs src/agents/harness/selection.test.ts src/agents/command/cli-compaction.test.ts src/agents/embedded-agent-runner/compact.hooks.test.ts packages/agent-core/src/agent-loop.test.ts packages/agent-core/src/harness/messages.test.ts` - `pnpm build` - autoreview clean - `pnpm check:changed` passed on Testbox `tbx_01kt407hq8sv1csm287pdj3fmp` - PR CI merge state `CLEAN`
Refactor the agent harness surface after PR openclaw#88821 by moving compaction dispatch into its own module, splitting the harness type into explicit capability interfaces, and renaming the private agent-core class declaration to `CoreAgentHarness` while preserving the exported `AgentHarness` contract. Verification: - `node scripts/run-vitest.mjs src/agents/harness/selection.test.ts src/agents/command/cli-compaction.test.ts src/agents/embedded-agent-runner/compact.hooks.test.ts packages/agent-core/src/agent-loop.test.ts packages/agent-core/src/harness/messages.test.ts` - `pnpm build` - autoreview clean - `pnpm check:changed` passed on Testbox `tbx_01kt407hq8sv1csm287pdj3fmp` - PR CI merge state `CLEAN`
Fixes #88811.
Root Cause
Slack/channel message handling did not establish one diagnostic trace scope for the inbound message. The callback path can bypass the gateway request span, the embedded runner created a separate unrepresented diagnostic parent for model diagnostics, and diagnostics-otel only created
openclaw.message.processedat completion time. That left harness/model diagnostics without a live message parent in OTel, queued async diagnostic ordering could make child spans arrive before the harness parent, and same-lifecycle file log records had no direct regression coverage for the channel trace scope.Implementation
openclaw.message.processedOTel span at message dispatch start for internal/trusted diagnostics, then finalize it after queued child diagnostics drain.openclaw.model.usagecan still parent under the harness span.harness.run.startedsynchronously so it can anchoropenclaw.run,openclaw.model.usage, andopenclaw.model.callchildren.logMessageDispatchStarted/logMessageProcessedlifecycle helpers, duplicate/skipped dispatch diagnostics, and file log records emitted in the same channel lifecycle.Real behavior proof
qa-channel, qa-lab bus, a real gateway child process, diagnostics-otel enabled, a local OTLP/HTTP receiver, and the repomock-openaiprovider.pnpm qa:otel:smoke -- --collector local --output-dir .artifacts/qa-e2e/otel-trace-waterfall-pr88821The local OTLP summary captured exported spans named
openclaw.run,openclaw.harness.run,openclaw.context.assembled,openclaw.message.delivery, and the GenAI model-call spanchat gpt-5.5, with no raw diagnostic id/content attributes exported and no model-call error attributes.openclaw.message.processedlifecycle span, harness/model child relationships, delivery span parentage, and duplicate/skipped dispatch diagnostics remain inside the active inbound trace.Tests and validation
node scripts/run-vitest.mjs extensions/diagnostics-otel/src/service.test.ts src/channels/turn/kernel.test.ts src/agents/harness/v2.test.tsnode scripts/run-vitest.mjs extensions/diagnostics-otel/src/service.test.ts src/channels/turn/kernel.test.ts src/agents/harness/v2.test.ts src/auto-reply/reply/dispatch-from-config.test.ts -t "keeps duplicate skip diagnostics inside the active inbound trace"OPENCLAW_VITEST_MAX_WORKERS=2 OPENCLAW_VITEST_SHARD_NAME=core-runtime-infra-file-safety OPENCLAW_VITEST_NO_OUTPUT_TIMEOUT_MS=900000 OPENCLAW_TEST_PROJECTS_PARALLEL=2 node scripts/test-projects.mjs src/infra/boundary-file-read.test.ts src/infra/boundary-path.test.ts src/infra/hardlink-guards.test.ts src/infra/replace-file.test.ts src/infra/resolve-system-bin.test.ts src/infra/safe-package-install.test.ts src/infra/stable-node-path.test.ts src/infra/watch-node.test.tspnpm check:test-typesOPENCLAW_ADDITIONAL_BOUNDARY_SHARD=2/4,3/4,4/4 OPENCLAW_ADDITIONAL_BOUNDARY_CONCURRENCY=4 node scripts/run-additional-boundary-checks.mjspnpm lint --threads=8pnpm qa:otel:smoke -- --collector local --output-dir .artifacts/qa-e2e/otel-trace-waterfall-pr88821Remote validation on head
12f51001e4dd388e111a5d1d1d90d428e712bca8:check-test-typespassed.check-additional-boundaries-apassed.check-additional-boundaries-bcdpassed.checks-node-core-runtime-infra-file-safetypassed.checks-node-core-runtime-media-uipassed.Real behavior proofpassed.check-lint,check-prod-types,check-shrinkwrap,check-guards, and the node shards reported green at the latest check.node scripts/crabbox-wrapper.mjs run -- env OPENCLAW_CHECK_CHANGED_REMOTE_CHILD=1 OPENCLAW_CHANGED_LANES_RAW_SYNC=1 corepack pnpm check:changeddid not reach project checks because local Crabbox failed its binary sanity check:[crabbox] selected binary failed basic --version/--help sanity checks.Risk checklist
No.No.No.Current review state
The branch has been rebased onto
origin/main(9cb347e4c364a42ce63788710ff6059e6464f711) and the previous dirty merge state is resolved. The earlier file-safety timeout was reproduced locally, traced to awatch-nodetest fake already fixed upstream, and the rebased PR now passes the remote file-safety shard.Remaining reviewer decision: whether the local qa-channel OTLP smoke proof is sufficient for this diagnostics PR, or whether maintainers still require a live Slack + Tempo artifact before merge.
AI-assisted change: implemented with Codex.