-
-
Notifications
You must be signed in to change notification settings - Fork 79.1k
[Bug]: Native Hook Relay — Stale Relay After Gateway Restart/Drain #89325
Copy link
Copy link
Open
Labels
P2Normal backlog priority with limited blast radius.Normal backlog priority with limited blast radius.bugSomething isn't workingSomething isn't workingbug:behaviorIncorrect behavior without a crashIncorrect behavior without a crashclawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.ClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.ClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.ClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:needs-security-reviewClawSweeper marked this issue as needing security-sensitive review.ClawSweeper marked this issue as needing security-sensitive review.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.Good issue quality with a plausible reproduction path needing some confirmation.
Metadata
Metadata
Assignees
Labels
P2Normal backlog priority with limited blast radius.Normal backlog priority with limited blast radius.bugSomething isn't workingSomething isn't workingbug:behaviorIncorrect behavior without a crashIncorrect behavior without a crashclawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.ClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.ClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.ClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:needs-security-reviewClawSweeper marked this issue as needing security-sensitive review.ClawSweeper marked this issue as needing security-sensitive review.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.Good issue quality with a plausible reproduction path needing some confirmation.
Type
Fields
Give feedbackNo fields configured for issues without a type.
Bug type
Behavior bug (incorrect output/state without crash)
Beta release blocker
No
Summary
After any gateway restart or drain, the main Codex session loses all native shell tools (exec, read, etc) with Native hook relay unavailable. Only a further openclaw gateway restart recovers it, and it recurs. Sub-sessions are unaffected because they create fresh relay descriptors.
Steps to reproduce
NOT_ENOUGH_INFO
Expected behavior
NOT_ENOUGH_INFO
Actual behavior
NOT_ENOUGH_INFO
OpenClaw version
2026.5.26
Operating system
Windows 10 WSL2
Install method
docker
Model
openai/gpt-5.5 codex runtime
Provider / routing chain
openclaw > openai through oauth
Additional provider/model setup details
No response
Logs, screenshots, and evidence
Impact and severity
Affected: me
Severity: annoying and blocks workflows
Frequency: was every time after gateway restart, workaround seems to have resolved but will fall over on openclaw update
Consequence: Agent has to spin up a fresh sub-agent for basic tasks.
Additional information
Verbatim output from agent:
Bug: Native Hook Relay — Stale Relay After Gateway Restart/Drain
OpenClaw 2026.5.26 · Node v24.15.0 · WSL2
Summary
After any gateway restart or drain, the main Codex session loses all native shell tools (exec, read, etc) with Native hook relay unavailable. Only a further openclaw gateway restart recovers it, and it recurs. Sub-sessions are unaffected because they create fresh relay descriptors.
Root cause — two related issues:
Issue A — Stale descriptor survival:
openclaw-clean-stale-native-relays only quarantines descriptors whose PID is dead. Because of PID reuse, a descriptor for the old gateway can survive pointing at the wrong process. The check needs to verify the live PID is actually an OpenClaw gateway, not any process.
Issue B — Hard deny on relay unavailability:
When the relay bridge is unavailable (stale, port mismatch, restarting), the Codex approval path in hooks-cli-.js / vision-tools-.js throws Native hook relay unavailable and blocks the command as a hard deny rather than falling back to normal OpenClaw policy. A transient relay liveness issue causes permanent session denial for the life of the Codex run.
Suggested fixes:
Workaround applied locally (dist patches, will be overwritten by updates):
• hooks-cli-Cy_Rqd6n.js — retry via gateway instead of immediately returning unavailable
• vision-tools-sfmDmVa9.js and @openclaw/codex/dist/vision-tools-DqpLmF5H.js — fall back to normal policy on relay failure
• openclaw-clean-stale-native-relays — also quarantines wrong-process PIDs
Post-patch: openclaw hooks check 7/7 ready.
Impact: Happened 4+ times in a single session (2026-06-02). Blocks all native tools; requires manual user intervention each time.