You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Plugin Hooks: Missing trace context fields for distributed tracing
Problem
The current Plugin Hooks event/context data is insufficient for building accurate distributed traces, particularly in these scenarios:
Concurrent messages in group chats: Multiple users sending messages simultaneously - cannot distinguish which hook event belongs to which user
Span parent-child relationships: Cannot determine if LLM/Tool spans are inside agent.run
Cross-hook correlation: Some hooks lack key identifiers (e.g., agent_end has no runId)
This prevents observability plugins (like opik-openclaw) from building accurate trace trees based on plugin hooks.
Scenario 1: Concurrent Group Chat Messages (Critical)
Background
Feishu group chat (groupId: group-abc123)
User A and User B send messages at the same time
Problem
[t0 ] User A sends: "Check the weather"
[t0+5ms] User B sends: "Write some code"
OpenClaw processing:
[t0+20 ] before_agent_start event
sessionKey: "feishu:group:group-abc123" ← Same for both users
prompt: "Check the weather"
❌ No senderId
❌ No messageId
❌ No runId
[t0+25 ] before_agent_start event
sessionKey: "feishu:group:group-abc123" ← Same
prompt: "Write some code"
❌ No senderId
❌ No messageId
[t0+50 ] llm_input event (User A)
sessionKey: "feishu:group:group-abc123" ← Same
runId: "???" ← Critical! If same, cannot distinguish
[t0+55 ] llm_input event (User B)
sessionKey: "feishu:group:group-abc123" ← Same
runId: "???"
Impact
❌ Plugins cannot determine which before_agent_start corresponds to which llm_input
❌ Cannot build separate traces (User A and B's calls get mixed together)
❌ Observability completely fails in group chat scenarios
Cannot determine if LLM/Tool spans are inside agent.run
Current hook data
before_agent_start context: {sessionKey: "session-123",// ❌ No spanId (cannot pass to subsequent hooks)}
llm_input context: {sessionKey: "session-123",// ❌ No parentSpanId (cannot determine parent)// ❌ No agentRunSpanId}
Ideal trace structure
openclaw.request.handle (root)
├─ openclaw.message.routing
├─ openclaw.agent.run ← Should be a child span
│ ├─ openclaw.llm ← Should nest under agent.run
│ └─ openclaw.tool.Read ← Should nest
└─ openclaw.message.send
exporttypePluginHookMessageSendingEvent={to: string;content: string;metadata?: Record<string,unknown>;// ✅ NEW: Correlation fieldssessionKey?: string;// Link to sessionrunId?: string;// Link to agent runmessageId?: string;// Link to inbound message};
Priority
P0 (Critical - Correctness Issues)
✅ Add runId to before_agent_start
✅ Add runId to agent_end
✅ Add messageId and senderId to all contexts
Impact:
Group chat concurrent scenarios completely unable to distinguish traces
Agent boundaries cannot be precisely determined
P1 (Important - Usability Issues)
✅ Add parentSpanId to contexts
✅ Add sessionKey to message hooks
Impact:
Cannot build nested trace trees
message_sending/sent cannot link to sessions
P2 (Enhancement - Improved Features)
✅ Add callDepth
✅ Add currentSpanId
Impact:
Cannot automatically infer nesting levels
Plugins must maintain span context themselves
Reproduction
Test Code
constsessionKey="feishu:group:group-123";// Simulate User Aemit('inbound_claim',{senderId: 'user-A',messageId: 'msg-A'},{channelId: 'feishu'});emit('before_agent_start',{prompt: 'Check weather'},{
sessionKey,// ← Same// ❌ No senderId// ❌ No messageId});// Simulate User B (concurrent)emit('inbound_claim',{senderId: 'user-B',messageId: 'msg-B'},{channelId: 'feishu'});emit('before_agent_start',{prompt: 'Write code'},{
sessionKey,// ← Same// ❌ Cannot distinguish User A from User B});
Expected: Two separate traces Actual: Plugins cannot distinguish, must use correlation guessing (error-prone)
Benefits of Adding These Fields
With these fields, plugins can:
✅ Accurately distinguish concurrent group messages
before_agent_start context: {sessionKey: "group-123",messageId: "msg-A",// ✅ Links to specific messagesenderId: "user-A",// ✅ Distinguishes usersrunId: "run-abc",// ✅ Unique identifier}
agent_end event: {success: true,runId: "run-abc",// ✅ Same runId as llm_input}
Context
We're implementing an observability plugin (extending opik-openclaw) and discovered these limitations.
The diagnostics event system has some data but lacks tool call spans (see opik-openclaw issue #37), while plugin hooks provide fine-grained events but lack unique identifiers and nesting context.
Ideally: Plugin hooks should have both fine-grained events AND complete trace context, enabling third-party plugins to build complete, accurate distributed traces.
exportasyncfunctionrunAgentTurnWithFallback(params: {...}){construnId=params.opts?.runId??crypto.randomUUID();// ↑ New UUID per function call// Pass to agent runnerconstresult=awaitrunEmbeddedPiAgent({
...
runId,// ✅ runId is available});}
Call chain (per user message):
User message
→ dispatchReplyFromConfig()
→ getReplyFromConfig()
→ runPreparedReply()
→ runReplyAgent()
→ runAgentTurnWithFallback() ← Generates runId HERE
→ crypto.randomUUID()
Conclusion: ✅ runId is unique per message, even in group chat concurrent scenarios.
Evidence 2: runId is Available but Not Passed to before_agent_start
// before_agent_start hook triggerhookRunner.runBeforeAgentStart({prompt: params.prompt,messages: params.messages,// ❌ No runId in event (even though params.runId exists!)},params.hookCtx// ❌ No runId in context either)
Plugin Hooks: Missing trace context fields for distributed tracing
Problem
The current Plugin Hooks event/context data is insufficient for building accurate distributed traces, particularly in these scenarios:
agent_endhas norunId)This prevents observability plugins (like opik-openclaw) from building accurate trace trees based on plugin hooks.
Scenario 1: Concurrent Group Chat Messages (Critical)
Background
group-abc123)Problem
Impact
before_agent_startcorresponds to whichllm_inputScenario 2: Missing Span Parent-Child Relationships
Problem
Cannot determine if LLM/Tool spans are inside agent.run
Current hook data
Ideal trace structure
Actual achievable structure
Without
parentSpanId, plugins can only create flat span lists, not hierarchical trace trees.Scenario 3: Cross-Hook Correlation Failure
Problem
agent_endhas norunIdImpact
Proposed Solution
1. Add common trace fields to all Hook Contexts
2. Add runId to all Hook Events (minimum requirement)
Hooks currently WITH runId:
Hooks currently WITHOUT runId:
Suggestion:
3. Add correlation fields to message hooks
Priority
P0 (Critical - Correctness Issues)
runIdtobefore_agent_startrunIdtoagent_endmessageIdandsenderIdto all contextsImpact:
P1 (Important - Usability Issues)
parentSpanIdto contextssessionKeyto message hooksImpact:
P2 (Enhancement - Improved Features)
callDepthcurrentSpanIdImpact:
Reproduction
Test Code
Expected: Two separate traces
Actual: Plugins cannot distinguish, must use correlation guessing (error-prone)
Benefits of Adding These Fields
With these fields, plugins can:
✅ Accurately distinguish concurrent group messages
✅ Build accurate nested traces
✅ Precisely correlate across hooks
Context
We're implementing an observability plugin (extending opik-openclaw) and discovered these limitations.
The diagnostics event system has some data but lacks tool call spans (see opik-openclaw issue #37), while plugin hooks provide fine-grained events but lack unique identifiers and nesting context.
Ideally: Plugin hooks should have both fine-grained events AND complete trace context, enabling third-party plugins to build complete, accurate distributed traces.
Related
🔍 Code Evidence: runId is Generated but Not Passed to All Hooks
Evidence 1: runId Generation (Per-Message)
File:
src/auto-reply/reply/agent-runner-execution.ts:110Call chain (per user message):
Conclusion: ✅ runId is unique per message, even in group chat concurrent scenarios.
Evidence 2: runId is Available but Not Passed to before_agent_start
File:
src/agents/pi-embedded-runner/run/attempt.ts:1197Compare with llm_input (line 2485):
At this point:
params.runIdexists (generated at line 110 of agent-runner-execution.ts)before_agent_startis triggered AFTER runId generationEvidence 3: agent_end Also Missing runId
Similar situation - runId exists throughout agent execution but is not passed to
agent_endhook.Impact on Group Chat Concurrent Messages
Scenario:
Current problem:
If runId were included:
🎯 Summary
The fix should be straightforward:
runId: stringtoPluginHookBeforeAgentStartEventrunId: stringtoPluginHookAgentEndEventrunId?: stringto message hook eventsparams.runIdwhen triggering these hooks中文
问题描述
当前 Plugin Hooks 的 event/context 数据不足以构建准确的分布式 Trace,特别是在以下场景:
agent_end无runId)这导致基于 plugin hooks 的可观测性插件(如 opik-openclaw)无法构建准确的 trace 树。
场景 1:群聊并发消息(严重问题)
背景
group-abc123)问题
群聊的 sessionKey 对所有用户相同(
feishu:group:group-abc123),且before_agent_start、llm_input等关键 hooks 的 context 中:senderId(无法区分用户)messageId(无法关联到具体消息)before_agent_start没有runId导致两个用户的并发消息完全无法区分,traces 会混在一起。
场景 2:Span 父子关系缺失
问题
Hook events 中没有
parentSpanId,无法确定调用的嵌套关系。理想的 trace 结构应该是嵌套的(agent.run 包含 LLM/Tool),但当前只能做到平铺结构(所有 spans 并列)。
场景 3:跨 Hook 关联失败
agent_end没有runId,无法与llm_input精确关联,无法确定 agent 处理边界。建议的解决方案
在所有 Hook Context 中添加:
messageId- 关联到具体消息senderId- 区分不同用户(群聊场景)runId- 统一的运行标识(所有 hooks 都有)parentSpanId- 明确的父 span 标识currentSpanId- 当前 span(供后续 hooks 使用)callDepth- 调用深度优先级
P0(严重)
before_agent_start和agent_end中添加runIdmessageId和senderIdP1(重要)
parentSpanIdsessionKeyP2(改善)
callDepth和currentSpanId背景
我们在实现可观测性插件(扩展 opik-openclaw)时发现了这些限制。
诊断事件系统虽然有部分数据,但缺少 tool call spans(见 opik-openclaw issue #37),而 plugin hooks 提供了细粒度事件,但缺少唯一标识和嵌套信息。
理想情况:plugin hooks 既有细粒度事件,又有完整的 trace context,这样第三方插件就能构建完整、准确的分布式 trace。
相关链接