fix(native-hook-relay): prune stale bridge files on registration#87563
fix(native-hook-relay): prune stale bridge files on registration#87563Applied-AI-Solutions-hub wants to merge 5 commits into
Conversation
|
Codex review: passed. Reviewed May 28, 2026, 11:49 AM ET / 15:49 UTC. Summary PR surface: Source +48, Tests +111, Config 0. Total +159 across 3 files. Reproducibility: yes. The PR body provides live before/after terminal and log proof for stale dead-PID bridge files, and source inspection shows current main and v2026.5.27 register bridges without stale-file pruning; I did not rerun the live gateway scenario locally. Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Land the guarded stale bridge-file prune with its tests once maintainers accept the shared-uid deletion semantics and bundled CI heap change, keeping broader fail-open relay policy for a separate discussion. Do we have a high-confidence way to reproduce the issue? Yes. The PR body provides live before/after terminal and log proof for stale dead-PID bridge files, and source inspection shows current main and v2026.5.27 register bridges without stale-file pruning; I did not rerun the live gateway scenario locally. Is this the best way to solve the issue? Yes, with maintainer risk acceptance. Pruning dead-PID or expired bridge records before writing the current bridge is a narrow fix, and the branch preserves live same-uid records with tests; the CI heap increase is separable but not a correctness defect. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 79e733cc34a3. Label changesLabel justifications:
Evidence reviewedPR surface: Source +48, Tests +111, Config 0. Total +159 across 3 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3c10b3d807
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (rec && typeof rec.pid === "number" && rec.pid !== process.pid) { | ||
| rmSync(full, { force: true }); |
There was a problem hiding this comment.
Preserve live bridge records from other processes
When two OpenClaw/Codex gateway or agent processes run under the same OS user, they share this uid-scoped bridge directory while their records are keyed by different relay ids. This unconditional pid !== process.pid prune lets whichever process registers next delete the other live process's bridge file, so the other session's hook CLI can no longer find its direct bridge and falls back to unavailable behavior until it re-registers; prune should verify the recorded PID is actually dead or the record is expired instead of treating every non-current PID as stale.
Useful? React with 👍 / 👎.
3c10b3d to
b2fbd28
Compare
|
Checked this PR against
Net: this fix is complementary to what's already in flight for 2026.5.27, not duplicative. Happy to retarget to — Adam |
|
ClawSweeper PR egg: ✨ hatched 🥚 common Clockwork Proofling. Rarity: 🥚 common. Trait: polishes edge cases. DetailsShare on X: post this hatch
About:
|
722d492 to
0d6d3a7
Compare
|
Quick triage of the remaining failing checks against the diff in this PR. All three failures are in files this PR does not modify, and per
|
Only the current gateway process should own records in the native hook relay bridge directory. After a crash, SIGUSR1 supervisor restart, or any non-graceful shutdown, stale bridge files (pointing to dead PIDs and unbound ports) can remain in /tmp/openclaw-native-hook-relays-<uid>/. When Codex clients enumerate the bridge directory and select a stale registration, every native tool call fails with 'Native hook relay unavailable' until the next clean restart. This is the recurring root cause behind openclaw#73723 and openclaw#87536. Fix: at the start of registerNativeHookRelayBridge, scan the bridge directory and remove any .json record whose 'pid' field does not match process.pid. This makes the relay system self-healing across crash/ restart cycles without requiring operators to manually clean up the directory or rely on ExecStartPre hooks. Refs: openclaw#87536, openclaw#73723 Signed-off-by: Adam Houk <adam@appliedai.solutions>
8a38d75 to
241d574
Compare
|
Thanks for the careful review, ClawSweeper — the foreign-live-record concern is valid and I've now addressed it. Updated behaviorThe prune predicate is no longer "PID ≠ mine → delete". It now requires the foreign PID to be dead (or the record to be expired) before we touch the file: if (!rec || typeof rec.pid !== "number" || rec.pid === process.pid) {
continue;
}
let pidAlive = false;
try {
process.kill(rec.pid, 0);
pidAlive = true;
} catch {
pidAlive = false;
}
const expired = typeof rec.expiresAtMs === "number" && now > rec.expiresAtMs;
if (pidAlive && !expired) {
// Live foreign record from another same-uid gateway/profile. Preserve it.
continue;
}
rmSync(full, { force: true });This means:
Why
|
|
@clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
@clawsweeper re-review |
|
🦞👀 Command router queued. I will update this comment with the next step. |
|
Wrapping up the contributor-side work on this PR. Posting one consolidated status so maintainers have a clean picture. What's now in this PR
CI stateGreen:
Red, and not contributor-fixable from this diff:
Per What I'd ask of a reviewer
The branch is Stepping back — this is now in maintainer territory. Cheers. — Adam |
|
@clawsweeper re-review |
|
🦞👀 Command router queued. I will update this comment with the next step. Re-review progress:
|
|
Adding one more real-world data point from a macOS production OpenClaw install, because the PR body says macOS was not tested. Environment:
Observed failure before containment:
Containment result:
Why this looks related to this PR:
Useful extra signal for maintainers:
I have not patched this production install with the PR branch yet; this comment is containment/repro evidence from the live macOS install after restoring service with a gateway restart. |
|
@clawsweeper visualize |
|
🦞👀 I queued a read-only visual pass. It will create or update one marker-backed visual brief comment and will not trigger close, merge, repair, label, or branch changes. Lens: |
|
Source: #87563 (comment) Visual briefLens: flow + state Advisory only: this brief is for maintainer judgment; maintainers remain the final judges. Proof map: Maintainer decision points: Legend: ✅ expected/proven, ❌ failing path, Maintainer rulingSelf-heals stale native hook bridge files after crashes/restarts while preserving live same-uid gateway records. Shared uid bridge-dir semantics, PID recycling false-preserve behavior, and bundled CI heap change need maintainer acceptance. Exact-head required checks green or explicitly judged unrelated; maintainer confirmation that dead/expired cleanup semantics are correct across supported gateway/profile layouts. Maintainer review focused on shared bridge-dir lifecycle, bundled workflow heap change, and exact-head check status. Should this targeted stale bridge-file cleanup land as-is, or should the CI heap change / broader relay policy questions be split from the runtime fix? |
|
@clawsweeper automerge |
|
🦞🔧 Source: I will update this PR branch, or open a safe credited replacement, if the repair worker finds a narrow fix. Automerge progress:
|
|
@clawsweeper automerge |
|
🦞👀 Command router queued. I will update this comment with the next step. Re-review progress:
|
|
ClawSweeper 🐠 reef update Thanks for the work on this. ClawSweeper did not have permission to update this branch directly, so it opened a narrow replacement PR instead. that's a branch access thing, not a knock on the contribution. Why replacement: ClawSweeper could not update the source PR branch directly; GitHub did not grant sufficient push rights to the bot for that branch.
fish notes: model gpt-5.5, reasoning high; reviewed against baf3826. |
Summary
Fixes the root cause behind issues #87536 and #73723: every native tool call in a Codex-backed session can return
Native hook relay unavailablebecause Codex picks a stale bridge registration file from a dead prior gateway process.After a crash,
SIGUSR1supervisor-restart, or any non-graceful shutdown, the file in/tmp/openclaw-native-hook-relays-<uid>/<hash>.jsonis left pointing to a dead PID and an unbound port. The current gateway's bridge then competes with these stale records in the directory, and Codex clients can pick an unreachable bridge. Once a client locks onto a stale entry, every hook invocation for that session fails until the next clean restart or manual cleanup.Fix
In
registerNativeHookRelayBridge, before registering the current process bridge, scan the bridge directory and remove.jsonrecords whose owning PID is dead or whoseexpiresAtMshas passed. Records owned by other live same-uid OpenClaw processes, such as a parallelopenclaw-beacon-gateway.service, are preserved.This PR now includes focused regression coverage for the review-sensitive predicates:
The
check-guardsplugin SDK declaration smoke also gets a larger Node heap for that exact command. The CI failure was an OOM innode scripts/run-tsgo.mjs -p tsconfig.plugin-sdk.dts.json --declaration true, not a native-hook-relay runtime failure.Real behavior proof
Behavior or issue addressed: Every native tool call in a Codex-backed session returns
Native hook relay unavailablewhenever/tmp/openclaw-native-hook-relays-<uid>/contains stale.jsonrecords pointing to dead PIDs from prior gateway processes. Codex clients can select a stale bridge, the bridge HTTP server is gone, and every PreToolUse hook invocation is blocked. This blocksdate,time date, file reads, MCP memory queries, and other native tool calls until the operator manually cleans the directory and restarts the gateway. The fix must also avoid breaking same-uid multi-gateway setups by preserving bridge records from other live OpenClaw processes.Real environment tested: OpenClaw
2026.5.26running as a systemd user service (openclaw-gateway.service) on Ubuntu 22.04 in WSL2 on Windows 11, Node 24.15.0 bundled with the gateway runtime, Codex plugin enabled withtransport: stdio,mode: yolo,approvalPolicy: never, andsandbox: danger-full-access. The live setup also had a parallelopenclaw-beacon-gateway.servicerunning under the same uid (1000), sharing/tmp/openclaw-native-hook-relays-1000/.Exact steps or command run after this patch:
dist/native-hook-relay-AN6S_wz5.json my live install.openclaw-gateway.serviceviasystemctl --user restart openclaw-gateway.service, leavingopenclaw-beacon-gateway.servicerunning.date, randateagain, then rantime date, with immediate and delayed retries across multiple Codex subagent sessions./tmp/openclaw-native-hook-relays-1000/andss -ltnbefore and after restart to confirm stale-record cleanup and live-record preservation.Evidence after fix: Before the patch, the bridge directory had five dead records plus one live ARC record and one live Beacon record:
Before the patch, live native tool calls from ARC were blocked:
After the patch, the runtime log at
/tmp/openclaw/openclaw-2026-05-28.logshowed only dead-PID records pruned:After the patch, native tool calls succeeded across multiple turns and time intervals:
The bridge directory after the patch contained only the two live processes:
Observed result after fix: Before this fix, the failure was reproducible after a
SIGUSR1supervisor restart or non-graceful gateway exit left even one stale bridge record behind. After this fix, the same test cycle produced zeroNative hook relay unavailableresponses across native tool invocations, while the parallel Beacon gateway continued operating without interruption. The added in-repo regression tests also cover dead foreign PID pruning, expired foreign record pruning, and live unexpired foreign record preservation.Not tested: I did not test on macOS or native Linux; the live proof was Ubuntu 22.04 in WSL2. I did not test the PID-recycling edge case where a previously-OpenClaw PID has been recycled by a non-OpenClaw process before registration; the predicate intentionally falls toward preserving that record until the recycled PID dies or
expiresAtMspasses, which is safer than deleting a potentially live foreign record. I did not run the entire full localpnpm build && pnpm check && pnpm testmatrix, but CI covers that breadth and the focused local checks listed below passed.Local validation
node scripts/run-vitest.mjs src/agents/harness/native-hook-relay.test.ts- 138 tests passednode scripts/run-oxlint.mjs src/agents/harness/native-hook-relay.test.ts src/agents/harness/native-hook-relay.ts- passedgit diff --check- passedcorepack pnpm check:workflows- passedNODE_OPTIONS="--max-old-space-size=8192" pnpm build:plugin-sdk:strict-smoke- passed locally after enabling a local pnpm shimWhy this is safe
process.kill(pid, 0)is signal-free under POSIX; it is a permission/existence probe only.What this PR does NOT do
This PR is intentionally scoped to defensive stale bridge-file cleanup. The broader architectural question of whether
renderNativeHookRelayUnavailableResponseshould fail open instead of fail closed is a separate safety-contract decision and should land in its own PR with an explicit config knob and maintainer discussion of the default.I also have local-only patches that defer the dedup file delete by 200ms and suppress timer-fired
handle.unregisterfrom the Codex plugin's run-attempt module. They are useful as defense in depth but are not strictly necessary once this prune lands, so they are not included here.Refs
Closes the recurring symptom in Native hook relay bridge never spawns on 2026.5.26 — all native/local tool calls return "Native hook relay unavailable" #87536
Addresses the same underlying cause as Native hook relay remains unavailable after gateway restart and hooks report 5/5 ready #73723
Complements
271baa1b47/42e9504114, which handle generation-grace behavior ininvokeNativeHookRelay; this PR handles stale filesystem bridge records duringregisterNativeHookRelayBridgeIncorporates ClawSweeper review feedback on same-uid multi-gateway preservation and focused predicate tests
Adam Houk, Applied AI Solutions