Skip to content

fix(exec): route notifyOnExit through exec-event and add heartbeat completion tracing#42394

Closed
ChauncyZhang wants to merge 1 commit into
openclaw:mainfrom
ChauncyZhang:fix/async-completion-ack
Closed

fix(exec): route notifyOnExit through exec-event and add heartbeat completion tracing#42394
ChauncyZhang wants to merge 1 commit into
openclaw:mainfrom
ChauncyZhang:fix/async-completion-ack

Conversation

@ChauncyZhang

Copy link
Copy Markdown

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem:
  • Async exec completion notification path was not using the canonical exec-event wake flow.
  • Completion event text/wake reason shape made exec-completion handling less consistent.
  • Why it matters:
  • Long-running task completion handoff could be unreliable, causing “started but no final completion notification” behavior in some flows.
  • What changed:
  • In notifyOnExit, route completion through emitExecSystemEvent(...) unified path.
  • Normalize completion event text to exec finished: ... so exec-completion filters can reliably detect it.
  • Normalize wake reason to canonical exec-event (instead of dynamic exec:<id>:exit) for consistent heartbeat reason handling.
  • Update tests in bash-tools.test.ts to match new completion text + wake reason.
  • What did NOT change (scope boundary):
  • No new tool capabilities.
  • No auth/token logic changes.
  • No API contract changes.
  • No new logging/tracing behavior included in this final cleaned PR.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #N/A
  • Related #N/A

User-visible / Behavior Changes

  • Improves reliability of async completion handoff for background exec tasks.
  • No UI changes.
  • Note: deployment config may need agents.defaults.heartbeat.target = "last" to relay async completion back to current chat.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:
  • N/A

Repro + Verification

Environment

  • OS: Ubuntu (WSL2)
  • Runtime/container: OpenClaw source runtime (Node.js)
  • Model/provider: openai/gpt-5.3-codex
  • Integration/channel (if any): Feishu DM
  • Relevant config (redacted):
  • agents.defaults.heartbeat.target = "last"

Steps

  1. Start gateway from source branch.
  2. Trigger background exec flow (e.g., clone repo via chat task).
  3. Verify completion is handed off and user receives completion message.

Expected

  • Start acknowledgement is sent.
  • Completion path is recognized as exec-event and completion message is relayed.

Actual

  • After fix: completion relay works in repeated real runs.

Evidence

Attach at least one:

  • Failing behavior before + passing after (real run logs)
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
  • Multiple real clone tasks from chat completed with final completion messages.
  • Completion wake path behaved consistently with canonical exec-event.
  • Edge cases checked:
  • notifyOnExit success path and test expectation alignment.
  • What you did not verify:
  • Non-Feishu channels end-to-end.
  • Multi-account routing permutations beyond current environment.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (Yes)
  • Migration needed? (Yes)
  • If yes, exact upgrade steps:
  1. Deploy this code change.
  2. Ensure config has agents.defaults.heartbeat.target = "last" if async completion should return to current chat.
  3. Restart gateway.

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly:
  • Revert this commit and restart gateway.
  • Files/config to restore:
  • src/agents/bash-tools.exec-runtime.ts
  • src/agents/bash-tools.test.ts
  • (optional runtime config) agents.defaults.heartbeat.target
  • Known bad symptoms reviewers should watch for:
  • Background exec finishes but no completion relay to user.
  • Completion path falls back to non-canonical reason and is skipped.

Risks and Mitigations

  • Risk:
  • Behavior relies on heartbeat delivery target config in deployment.
  • Mitigation:
  • Explicitly document and set agents.defaults.heartbeat.target = "last" in runtime config.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: XS labels Mar 10, 2026
@greptile-apps

greptile-apps Bot commented Mar 10, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR standardizes the async exec completion notification path by routing notifyOnExit through the existing emitExecSystemEvent unified helper, normalizing the event text prefix to exec finished: ... and the heartbeat wake reason to the canonical "exec-event" string.

Key observations:

  • Heartbeat reason change: The wake reason changes from the per-session dynamic string (previously exec:<session-id>:exit) to the static "exec-event". Any code outside these two files that pattern-matched on the old dynamic reason format would silently stop matching — worth a quick grep across heartbeat consumer code before deploying.
  • contextKey now included on exit events: The new call path passes contextKey carrying the session-scoped exit identifier to enqueueSystemEvent, whereas the old inline call did not. This is a net improvement (per-session event deduplication at the queue level) and is consistent with how other exec events are emitted, but it is a subtle behavioral delta not called out in the PR description.
  • Test coverage is correct: Both wake-reason tests correctly drop the now-unused sessionId variable and assert on the new static reason. The toMatchObject vs toEqual distinction between the scoped and unscoped cases is preserved. OUTPUT_EXEC_COMPLETED is updated to match the new event text prefix.

Confidence Score: 4/5

  • Safe to merge; changes are logically correct and tests are updated, with one undocumented behavioral delta worth verifying before deployment.
  • The refactoring correctly consolidates duplicate inline logic into the existing emitExecSystemEvent helper, the heartbeat reason normalization is intentional and well-tested, and the session.exitNotified guard prevents duplicate emissions. The score is 4 rather than 5 because the implicit addition of contextKey to the exit system event (changing deduplication behavior) is not documented in the PR description, and there is no in-PR evidence that downstream heartbeat consumers relying on the old dynamic reason format have been audited.
  • No files require special attention beyond the noted behavioral delta in src/agents/bash-tools.exec-runtime.ts.

Last reviewed commit: d375663

@holgergruenhagen

Copy link
Copy Markdown

We reproduced this bug live on Discord and confirmed that the minimal wake-reason change in #41479 resolves the local background-exec completion-notification failure for us.

So from our downstream validation:

I’m linking this here because it appears architecturally aligned with what we saw in production, but the exact live-validated fix in our environment was the narrower change from exec:<session-id>:exit to exec-event.

No objection to the broader cleanup here — just noting that the concrete live proof we have maps most directly to #41479.

Related exact minimal fix: #41479

@vincentkoc vincentkoc added close:duplicate Closed as duplicate dedupe:child Duplicate issue/PR child in dedupe cluster labels Apr 5, 2026
@vincentkoc

Copy link
Copy Markdown
Member

closing this in favor of #41479. both fix the notifyOnExit to exec-event wake mismatch, but #41479 is the tighter branch and the better canonical path. if the extra tracing in this branch is still wanted, that should probably come back as a separate follow-up.

@vincentkoc vincentkoc closed this Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling close:duplicate Closed as duplicate dedupe:child Duplicate issue/PR child in dedupe cluster size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants