Skip to content

Correlate Slack/channel message diagnostics into a single trace #88811

@bek91

Description

@bek91

Summary

OpenClaw's diagnostics/logging docs describe request-level trace correlation across gateway handling, diagnostic events, agent runs, model usage, and model calls. In a Slack-agent run on OpenClaw 2026.5.28, the relevant diagnostics arrived in Tempo as separate root traces instead of one correlated trace waterfall.

This makes it hard to answer the basic lifecycle question: "How long did one inbound Slack message take from receipt to reply, and where was the time spent?"

Expected behavior

docs/logging.md describes trace correlation as follows:

  • Gateway HTTP requests and WebSocket frames establish an internal request trace scope.
  • Logs and diagnostic events in that async scope inherit the request trace if no explicit context is provided.
  • Agent-run and model-call traces should be children of the active request trace.
  • Local logs, diagnostic snapshots, OTel spans, and provider trace headers should be joinable by traceId.

For one inbound Slack message, I would therefore expect a single trace containing spans/events such as:

  • openclaw.message.processed
  • openclaw.harness.run
  • openclaw.model.usage
  • openclaw.model.call

Observed behavior

Environment, sanitized:

  • OpenClaw version: 2026.5.28
  • Build/revision observed: e932160
  • Runtime shape: OpenClaw agent with Slack channel enabled and @openclaw/diagnostics-otel enabled
  • OTel exporter: OTLP/HTTP to Tempo
  • OTel settings: traces enabled, metrics enabled, logs disabled, sample rate 1.0

After sending one Slack message and receiving a reply, Tempo accepted new spans and the expected OpenClaw span names were present. However, the lifecycle spans appeared as separate root traces rather than one parented trace.

Observed spans from the same Slack interaction:

Span name Duration Key attributes
openclaw.message.processed 5.009s openclaw.channel=slack, openclaw.outcome=completed
openclaw.model.usage 4.441s openclaw.channel=slack, openclaw.provider=openai-codex, openclaw.model=gpt-5.5, token counts present
openclaw.harness.run 4.212s openclaw.harness.id=codex, openclaw.harness.plugin=codex, openclaw.outcome=completed
openclaw.model.call 2.837s openclaw.api=openai-codex-responses, openclaw.transport=stdio, openclaw.model=gpt-5.5
openclaw.message.processed 6ms openclaw.channel=slack, openclaw.outcome=skipped, openclaw.reason=duplicate

Additional records:

  • The completed openclaw.message.processed span began at approximately 2026-05-31T18:24:21.587-04:00 and lasted about 5.009s.
  • The gateway delivered the Slack reply at approximately 2026-05-31T18:24:26-04:00, matching the message-processing span duration.
  • openclaw.model.usage began within the message-processing window and lasted about 4.441s.
  • openclaw.harness.run began within the message-processing window and lasted about 4.212s.
  • openclaw.model.call began within the harness/model window and lasted about 2.837s.
  • Tempo search tags included the expected values for openclaw.channel=slack, openclaw.provider, openclaw.model, openclaw.harness.*, gen_ai.*, and the span names above.

The data is internally consistent as one Slack turn, but trace parentage/correlation is missing.

Why this matters

Without trace correlation, operators can see isolated spans but cannot reliably inspect one message lifecycle as a single waterfall from receipt through dispatch, harness/model execution, and reply delivery.

Likely area to investigate

This may be specific to Slack/channel message ingestion rather than HTTP request handling. Slack socket-mode callbacks and other long-lived channel callbacks may not naturally run inside the same gateway HTTP/WebSocket request trace scope described in docs/logging.md.

Potential implementation shape:

  • Create or preserve a per-inbound-message trace context at message.received / dispatch start.
  • Run message dispatch, session turn creation, harness execution, model usage, model calls, and reply delivery inside that active diagnostic trace context.
  • Ensure emitted diagnostics events carry traceId, spanId, and parentSpanId consistently.
  • Verify the OTel diagnostics plugin preserves parentage when converting diagnostic events to spans.

Potential files/areas:

  • src/infra/diagnostic-events.ts
  • src/infra/diagnostic-trace-context.ts
  • gateway HTTP/WebSocket request scope setup
  • Slack/channel monitor or dispatch code
  • auto-reply/session turn creation path
  • embedded agent runner model diagnostics
  • diagnostics OTel exporter span conversion

Success criteria

  • A single inbound Slack message produces one OTel trace containing the message lifecycle, harness, model usage, and model call spans.
  • openclaw.message.processed, openclaw.harness.run, openclaw.model.usage, and openclaw.model.call share the same traceId.
  • Child spans have meaningful parentSpanId relationships instead of appearing as separate roots.
  • The duplicate/skipped message event is either clearly parented to the same inbound-message trace or intentionally documented as a separate trace.
  • File logs emitted during the same async lifecycle include the same top-level traceId where diagnostic trace context is available.
  • A regression test covers the channel/Slack-style inbound message path and asserts trace correlation across lifecycle and model diagnostics.
  • Live verification with @openclaw/diagnostics-otel and Tempo shows one trace waterfall for one Slack reply lifecycle.

Metadata

Metadata

Assignees

Labels

P2Normal backlog priority with limited blast radius.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:otherThis issue has meaningful maintainer-visible impact outside the owned taxonomy.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.maintainerMaintainer-authored PR

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions