Skip to content

fix(native-hook-relay): prune stale bridge files on registration#87563

Closed
Applied-AI-Solutions-hub wants to merge 5 commits into
openclaw:mainfrom
Applied-AI-Solutions-hub:fix/native-hook-relay-stale-prune
Closed

fix(native-hook-relay): prune stale bridge files on registration#87563
Applied-AI-Solutions-hub wants to merge 5 commits into
openclaw:mainfrom
Applied-AI-Solutions-hub:fix/native-hook-relay-stale-prune

Conversation

@Applied-AI-Solutions-hub

@Applied-AI-Solutions-hub Applied-AI-Solutions-hub commented May 28, 2026

Copy link
Copy Markdown
Contributor

AI-assisted PR - diagnostic instrumentation, repro analysis, and patch authoring were done in a co-working session with Claude (Anthropic) on top of Cowork, and with ARC (a local OpenClaw agent on my install) as a second pair of eyes. All real-behavior tests below were run by me against my own gateway. I understand what the code does and have reviewed every line of the diff manually.

Summary

Fixes the root cause behind issues #87536 and #73723: every native tool call in a Codex-backed session can return Native hook relay unavailable because Codex picks a stale bridge registration file from a dead prior gateway process.

After a crash, SIGUSR1 supervisor-restart, or any non-graceful shutdown, the file in /tmp/openclaw-native-hook-relays-<uid>/<hash>.json is left pointing to a dead PID and an unbound port. The current gateway's bridge then competes with these stale records in the directory, and Codex clients can pick an unreachable bridge. Once a client locks onto a stale entry, every hook invocation for that session fails until the next clean restart or manual cleanup.

Fix

In registerNativeHookRelayBridge, before registering the current process bridge, scan the bridge directory and remove .json records whose owning PID is dead or whose expiresAtMs has passed. Records owned by other live same-uid OpenClaw processes, such as a parallel openclaw-beacon-gateway.service, are preserved.

This PR now includes focused regression coverage for the review-sensitive predicates:

  • dead foreign PID record is pruned
  • expired foreign record is pruned even if the PID is alive
  • live unexpired foreign record is preserved

The check-guards plugin SDK declaration smoke also gets a larger Node heap for that exact command. The CI failure was an OOM in node scripts/run-tsgo.mjs -p tsconfig.plugin-sdk.dts.json --declaration true, not a native-hook-relay runtime failure.

Real behavior proof

Behavior or issue addressed: Every native tool call in a Codex-backed session returns Native hook relay unavailable whenever /tmp/openclaw-native-hook-relays-<uid>/ contains stale .json records pointing to dead PIDs from prior gateway processes. Codex clients can select a stale bridge, the bridge HTTP server is gone, and every PreToolUse hook invocation is blocked. This blocks date, time date, file reads, MCP memory queries, and other native tool calls until the operator manually cleans the directory and restarts the gateway. The fix must also avoid breaking same-uid multi-gateway setups by preserving bridge records from other live OpenClaw processes.

Real environment tested: OpenClaw 2026.5.26 running as a systemd user service (openclaw-gateway.service) on Ubuntu 22.04 in WSL2 on Windows 11, Node 24.15.0 bundled with the gateway runtime, Codex plugin enabled with transport: stdio, mode: yolo, approvalPolicy: never, and sandbox: danger-full-access. The live setup also had a parallel openclaw-beacon-gateway.service running under the same uid (1000), sharing /tmp/openclaw-native-hook-relays-1000/.

Exact steps or command run after this patch:

  1. Applied the equivalent of this PR's source diff to the bundled dist/native-hook-relay-AN6S_wz5.js on my live install.
  2. Restarted only openclaw-gateway.service via systemctl --user restart openclaw-gateway.service, leaving openclaw-beacon-gateway.service running.
  3. Confirmed the Beacon gateway still answered requests after the ARC gateway restart, proving its bridge record was preserved rather than pruned.
  4. From ARC over the Discord channel, ran date, ran date again, then ran time date, with immediate and delayed retries across multiple Codex subagent sessions.
  5. Inspected /tmp/openclaw-native-hook-relays-1000/ and ss -ltn before and after restart to confirm stale-record cleanup and live-record preservation.

Evidence after fix: Before the patch, the bridge directory had five dead records plus one live ARC record and one live Beacon record:

$ ls -la /tmp/openclaw-native-hook-relays-1000/
-rw-------  pid=11601  port=37109   # dead from prior session
-rw-------  pid=17733  port=46593   # dead
-rw-------  pid=90952  port=38385   # dead
-rw-------  pid=211925 port=44387   # dead
-rw-------  pid=240009 port=43303   # dead
-rw-------  pid=243162 port=46523   # live ARC gateway
-rw-------  pid=229591 port=18889   # live Beacon gateway, different profile, same uid

Before the patch, live native tool calls from ARC were blocked:

> run date
Command blocked by PreToolUse hook: Native hook relay unavailable.
Command: date

> run time date
Command blocked by PreToolUse hook: Native hook relay unavailable.
Command: time date

After the patch, the runtime log at /tmp/openclaw/openclaw-2026-05-28.log showed only dead-PID records pruned:

pruned stale native hook relay bridge file file=084c22487544d61d08bb88f20d738f6b.json stalePid=240009 currentPid=243162 reason=dead-pid
pruned stale native hook relay bridge file file=2413c88dfc4ad1e4e37edfe66958a932.json stalePid=211925 currentPid=243162 reason=dead-pid
pruned stale native hook relay bridge file file=90c38fd2c272f3d6c9d03a346dce35f3.json stalePid=90952 currentPid=243162 reason=dead-pid
pruned stale native hook relay bridge file file=654149472e7cd9034900a8d2dd43661e.json stalePid=17733 currentPid=243162 reason=dead-pid
pruned stale native hook relay bridge file file=5835f8d6faece4fb32fcec4784cd25d3.json stalePid=11601 currentPid=243162 reason=dead-pid
Beacon bridge record (pid 229591) was not pruned because process.kill(229591, 0) succeeded.

After the patch, native tool calls succeeded across multiple turns and time intervals:

> run date
Thu May 28 02:55:15 EDT 2026

> run date
Thu May 28 02:55:19 EDT 2026

> run date
Thu May 28 03:18:41 EDT 2026

> run time date
Thu May 28 03:21:04 EDT 2026
real    0m0.001s
user    0m0.000s
sys     0m0.001s

The bridge directory after the patch contained only the two live processes:

$ ls -la /tmp/openclaw-native-hook-relays-1000/
-rw-------  pid=243162 port=46523   # ARC gateway, current process that just registered
-rw-------  pid=229591 port=18889   # Beacon gateway, live foreign record preserved

Observed result after fix: Before this fix, the failure was reproducible after a SIGUSR1 supervisor restart or non-graceful gateway exit left even one stale bridge record behind. After this fix, the same test cycle produced zero Native hook relay unavailable responses across native tool invocations, while the parallel Beacon gateway continued operating without interruption. The added in-repo regression tests also cover dead foreign PID pruning, expired foreign record pruning, and live unexpired foreign record preservation.

Not tested: I did not test on macOS or native Linux; the live proof was Ubuntu 22.04 in WSL2. I did not test the PID-recycling edge case where a previously-OpenClaw PID has been recycled by a non-OpenClaw process before registration; the predicate intentionally falls toward preserving that record until the recycled PID dies or expiresAtMs passes, which is safer than deleting a potentially live foreign record. I did not run the entire full local pnpm build && pnpm check && pnpm test matrix, but CI covers that breadth and the focused local checks listed below passed.

Local validation

  • node scripts/run-vitest.mjs src/agents/harness/native-hook-relay.test.ts - 138 tests passed
  • node scripts/run-oxlint.mjs src/agents/harness/native-hook-relay.test.ts src/agents/harness/native-hook-relay.ts - passed
  • git diff --check - passed
  • corepack pnpm check:workflows - passed
  • NODE_OPTIONS="--max-old-space-size=8192" pnpm build:plugin-sdk:strict-smoke - passed locally after enabling a local pnpm shim

Why this is safe

  • No public API change. The function signature is unchanged; behavior is purely additive.
  • No effect on the happy path. When the bridge dir is already clean, the scan is a no-op.
  • Live foreign records are preserved. Same-uid multi-profile setups continue to work.
  • process.kill(pid, 0) is signal-free under POSIX; it is a permission/existence probe only.
  • Errors are swallowed. A racing read/delete or unparseable file gets skipped, and registration of the current bridge proceeds.
  • Idempotent. Running registration multiple times produces the same final state.

What this PR does NOT do

This PR is intentionally scoped to defensive stale bridge-file cleanup. The broader architectural question of whether renderNativeHookRelayUnavailableResponse should fail open instead of fail closed is a separate safety-contract decision and should land in its own PR with an explicit config knob and maintainer discussion of the default.

I also have local-only patches that defer the dedup file delete by 200ms and suppress timer-fired handle.unregister from the Codex plugin's run-attempt module. They are useful as defense in depth but are not strictly necessary once this prune lands, so they are not included here.

Refs

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: XS triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 28, 2026
@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Codex review: passed. Reviewed May 28, 2026, 11:49 AM ET / 15:49 UTC.

Summary
This PR prunes dead or expired native-hook relay bridge registry files during registration, adds regression tests for the prune/preserve predicates, and raises the check-guards plugin SDK strict-smoke Node heap.

PR surface: Source +48, Tests +111, Config 0. Total +159 across 3 files.

Reproducibility: yes. The PR body provides live before/after terminal and log proof for stale dead-PID bridge files, and source inspection shows current main and v2026.5.27 register bridges without stale-file pruning; I did not rerun the live gateway scenario locally.

Review metrics: 1 noteworthy metric.

  • CI heap override: 1 workflow command changed. The branch raises check-guards plugin SDK strict-smoke to an 8 GB Node heap, so maintainers should notice the automation capacity change before merge.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Keep landing gated on exact-head checks and maintainer acceptance of the shared-uid prune boundary plus CI heap change.

Risk before merge

  • [P1] The bridge directory is shared by same-uid OpenClaw gateway/profile processes, so maintainers should accept dead PID or expiry as the compatibility boundary for deleting foreign bridge records.
  • [P1] If the prune predicate misclassifies a live foreign record, native tool availability can regress for another gateway/profile in the same uid-scoped directory.
  • [P1] The PR also raises the check-guards plugin SDK strict-smoke Node heap to 8 GB, which changes CI automation resource behavior in the same branch as the runtime fix.

Maintainer options:

  1. Accept the guarded prune boundary (recommended)
    Maintain the current dead-PID-or-expired predicate, rely on the added regression coverage and supplied live proof, and let exact-head checks gate landing.
  2. Split the CI heap increase
    Move the check-guards NODE_OPTIONS change into a separate CI PR if maintainers want this branch to carry only runtime relay behavior.
  3. Pause for broader relay policy
    Pause this PR if deleting expired live foreign records or fail-closed relay behavior needs a wider product/runtime policy decision first.

Next step before merge

  • [P2] No repair lane is needed; the branch is bounded, has no blocking code finding, and the remaining action is landing risk acceptance rather than a code repair.

Security
Cleared: No concrete security or supply-chain regression was found; deletion is scoped to the private uid relay directory and the workflow change only raises heap for an existing check.

Review details

Best possible solution:

Land the guarded stale bridge-file prune with its tests once maintainers accept the shared-uid deletion semantics and bundled CI heap change, keeping broader fail-open relay policy for a separate discussion.

Do we have a high-confidence way to reproduce the issue?

Yes. The PR body provides live before/after terminal and log proof for stale dead-PID bridge files, and source inspection shows current main and v2026.5.27 register bridges without stale-file pruning; I did not rerun the live gateway scenario locally.

Is this the best way to solve the issue?

Yes, with maintainer risk acceptance. Pruning dead-PID or expired bridge records before writing the current bridge is a narrow fix, and the branch preserves live same-uid records with tests; the CI heap increase is separable but not a correctness defect.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 79e733cc34a3.

Label changes

Label justifications:

  • P1: The PR targets a real native-hook relay failure that can block agent shell, filesystem, and local tool execution for users now.
  • merge-risk: 🚨 compatibility: The patch changes when one same-uid OpenClaw process may delete another bridge registry file, which is compatibility-sensitive for multi-profile setups.
  • merge-risk: 🚨 availability: A wrong bridge-pruning predicate could make live native hook relay records disappear and cause native tool calls to fail unavailable.
  • merge-risk: 🚨 automation: The PR changes a CI check command's Node heap limit, affecting automation resource behavior outside the runtime bug fix.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 🚀 automerge armed: This PR is in ClawSweeper's automerge lane. Sufficient (terminal): The PR body supplies after-fix live WSL2 gateway terminal/log proof showing dead bridge records pruned, a live foreign record preserved, and native commands succeeding after the patch.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body supplies after-fix live WSL2 gateway terminal/log proof showing dead bridge records pruned, a live foreign record preserved, and native commands succeeding after the patch.
Evidence reviewed

PR surface:

Source +48, Tests +111, Config 0. Total +159 across 3 files.

View PR surface stats
Area Files Added Removed Net
Source 1 49 1 +48
Tests 1 112 1 +111
Docs 0 0 0 0
Config 1 1 1 0
Generated 0 0 0 0
Other 0 0 0 0
Total 3 162 3 +159

What I checked:

  • Repository policy read: Read the full root AGENTS.md and the scoped src/agents/AGENTS.md; their compatibility-sensitive runtime, proof, and test-surface guidance affected this review. (AGENTS.md:1, 79e733cc34a3)
  • Current main lacks stale bridge pruning: On current main, registerNativeHookRelayBridge unregisters the current relay and creates a new bridge without scanning the uid-scoped bridge directory for dead or expired foreign records. (src/agents/harness/native-hook-relay.ts:703, 79e733cc34a3)
  • Latest release also lacks stale bridge pruning: The v2026.5.27 implementation of registerNativeHookRelayBridge has the same no-prune registration path, so this PR is not duplicating a shipped fix. (src/agents/harness/native-hook-relay.ts:696, 27ae826f6525)
  • PR adds guarded cleanup: At the PR head, registration scans .json bridge files, probes foreign PIDs, preserves live unexpired records, and removes dead-PID or expired records before writing the current bridge. (src/agents/harness/native-hook-relay.ts:704, 613071ef179b)
  • Predicate regression coverage: The PR head adds focused tests for pruning a dead foreign PID, pruning an expired foreign record even when the PID is alive, and preserving a live unexpired foreign record. (src/agents/harness/native-hook-relay.test.ts:881, 613071ef179b)
  • Workflow surface changed: The PR also changes the check-guards plugin SDK strict-smoke command to run with an 8 GB Node heap, which is separate automation behavior bundled with the runtime fix. (.github/workflows/ci.yml:1175, 613071ef179b)

Likely related people:

  • pash-openai: git blame on the bridge registration path ties the core direct-bridge lifecycle to commit 3b5dab3, which added the keep-live-across-turns relay implementation and tests. (role: introduced behavior; confidence: high; commits: 3b5dab372ac2; files: src/agents/harness/native-hook-relay.ts, src/agents/harness/native-hook-relay.test.ts)
  • Thesaranshn8n: git blame shows the request/server setup inside registerNativeHookRelayBridge came from the recent shared native-hook relay registry work. (role: recent area contributor; confidence: high; commits: 6729dea36f75; files: src/agents/harness/native-hook-relay.ts, src/agents/harness/native-hook-relay.test.ts)
  • Peter Steinberger: git blame shows the bridge directory and registry write hardening around this path came from Peter's May native-hook-relay commits, and the latest release tag is also in this area of release ownership. (role: adjacent owner; confidence: medium; commits: 694ca50e9775, a641a27bd437; files: src/agents/harness/native-hook-relay.ts)
  • amknight: commit 42e9504 recently touched the same native-hook relay and Codex app-server resume path to preserve relay generation across restarts, which is adjacent to this stale bridge-file fix. (role: recent adjacent contributor; confidence: medium; commits: 42e9504114f3; files: src/agents/harness/native-hook-relay.ts, src/agents/harness/native-hook-relay.test.ts, extensions/codex/src/app-server/native-hook-relay.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c10b3d807

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/harness/native-hook-relay.ts Outdated
Comment on lines +716 to +717
if (rec && typeof rec.pid === "number" && rec.pid !== process.pid) {
rmSync(full, { force: true });

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve live bridge records from other processes

When two OpenClaw/Codex gateway or agent processes run under the same OS user, they share this uid-scoped bridge directory while their records are keyed by different relay ids. This unconditional pid !== process.pid prune lets whichever process registers next delete the other live process's bridge file, so the other session's hook CLI can no longer find its direct bridge and falls back to unavailable behavior until it re-registers; prune should verify the recorded PID is actually dead or the record is expired instead of treating every non-current PID as stale.

Useful? React with 👍 / 👎.

@Applied-AI-Solutions-hub Applied-AI-Solutions-hub force-pushed the fix/native-hook-relay-stale-prune branch from 3c10b3d to b2fbd28 Compare May 28, 2026 07:41
@Applied-AI-Solutions-hub

Copy link
Copy Markdown
Contributor Author

Checked this PR against release/2026.5.27 (the branch that produced 2026.5.27-beta.1) to make sure I'm not duplicating in-flight work:

  • registerNativeHookRelayBridge and unregisterNativeHookRelayBridge are byte-identical between main and release/2026.5.27. The stale-file prune I'm adding is not present on either branch.
  • The 2026.5.27 release branch does tighten related code: canAcceptNativeHookRelayGenerationMismatch was simplified to drop the per-generation tracking, and the agent-tools.before-tool-call.js import was renamed to pi-tools.before-tool-call.js. Neither touches the bridge dir / file lifecycle that this PR addresses.
  • I've rebased my branch onto the current upstream main so the PR is fresh.

Net: this fix is complementary to what's already in flight for 2026.5.27, not duplicative. Happy to retarget to release/2026.5.27 if maintainers prefer the fix lands in the upcoming release rather than waiting for the next minor.

— Adam

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 🛠️ actively grinding The PR author has acted after the latest ClawSweeper review and work remains. labels May 28, 2026
@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg: ✨ hatched 🥚 common Clockwork Proofling. Rarity: 🥚 common. Trait: polishes edge cases.

Details

Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Clockwork Proofling in ClawSweeper.
Hatchability:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

About:

  • Eggs appear after real-behavior proof passes. They are collectible flavor only.
  • Review momentum changes the shell state: follow-up work warms it, re-review makes it wobble, and a clean final review lets it hatch.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

@clawsweeper clawsweeper Bot added status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. and removed status: 🛠️ actively grinding The PR author has acted after the latest ClawSweeper review and work remains. labels May 28, 2026
@Applied-AI-Solutions-hub Applied-AI-Solutions-hub force-pushed the fix/native-hook-relay-stale-prune branch from 722d492 to 0d6d3a7 Compare May 28, 2026 08:28
@Applied-AI-Solutions-hub

Copy link
Copy Markdown
Contributor Author

Quick triage of the remaining failing checks against the diff in this PR. All three failures are in files this PR does not modify, and per CONTRIBUTING.md ("Do not submit test/CI-only PRs for known main failures") I'm flagging rather than chasing them. Happy to be corrected if maintainers see a connection I'm missing.

check-guards — Node OOM in TypeScript declaration build

The failure is in the pnpm build:plugin-sdk:dts step, specifically node scripts/run-tsgo.mjs -p tsconfig.plugin-sdk.dts.json --declaration true:

FATAL ERROR: Ineffective mark-compacts near heap limit
Allocation failed - JavaScript heap out of memory
[ELIFECYCLE] Command failed with exit code 134.

TypeScript hit the default ~2 GB Node heap during a full plugin-SDK .d.ts generation. The only diff in this PR is a 30-line additive change to one function in src/agents/harness/native-hook-relay.ts — it cannot meaningfully change TypeScript's heap profile. Looks like the tsgo run was already close to the OOM line on main.

checks-node-agentic-control-plane-runtime — pre-existing vitest module-mock teardown

FAIL  src/gateway/server.shared-token-hot-reload.test.ts
Error: [vitest] There was an error when mocking a module.
Caused by: EnvironmentTeardownError:
  Cannot load '/src/config/io.ts?_vitest_original' after the environment was torn down.

The failing file is src/gateway/server.shared-token-hot-reload.test.ts and the underlying environment-teardown trace runs through src/gateway/test-helpers.server.ts, src/config/config.ts, src/config/io.ts, src/agents/subagent-orphan-recovery.ts. This PR does not modify any of those.

checks-node-auto-reply-reply-agent-runner — pre-existing memory-flush assertion

FAIL  src/auto-reply/reply/agent-runner-memory.test.ts
  > runMemoryFlushIfNeeded
  > does not count bytes from a large latest usage record as post-usage tail pressure
AssertionError: expected [ Array(1) ] to deeply equal []

The failing path is src/auto-reply/reply/agent-runner-memory.test.ts and the production code it exercises is under src/auto-reply/**. This PR does not modify any of those.

What is passing

  • check-lint
  • check-prod-types, check-test-types, check-dependencies
  • All check-additional-* boundary checks
  • All Security High * and Critical Quality * jobs
  • Real behavior proof (3 consecutive successful runs at 09:00:27, 09:00:44, 09:01:00 against the latest body and head)
  • All checks-fast-* shards and bundled-protocol checks

If the OOM or the two test failures are not already tracked, I'm happy to file separate issues for them. If they are, I'll defer to whoever's triaging.

— Adam

Only the current gateway process should own records in the native hook
relay bridge directory. After a crash, SIGUSR1 supervisor restart, or any
non-graceful shutdown, stale bridge files (pointing to dead PIDs and
unbound ports) can remain in /tmp/openclaw-native-hook-relays-<uid>/.

When Codex clients enumerate the bridge directory and select a stale
registration, every native tool call fails with 'Native hook relay
unavailable' until the next clean restart. This is the recurring root
cause behind openclaw#73723 and openclaw#87536.

Fix: at the start of registerNativeHookRelayBridge, scan the bridge
directory and remove any .json record whose 'pid' field does not match
process.pid. This makes the relay system self-healing across crash/
restart cycles without requiring operators to manually clean up the
directory or rely on ExecStartPre hooks.

Refs: openclaw#87536, openclaw#73723

Signed-off-by: Adam Houk <adam@appliedai.solutions>
@Applied-AI-Solutions-hub Applied-AI-Solutions-hub force-pushed the fix/native-hook-relay-stale-prune branch from 8a38d75 to 241d574 Compare May 28, 2026 09:16
@Applied-AI-Solutions-hub

Copy link
Copy Markdown
Contributor Author

Thanks for the careful review, ClawSweeper — the foreign-live-record concern is valid and I've now addressed it.

Updated behavior

The prune predicate is no longer "PID ≠ mine → delete". It now requires the foreign PID to be dead (or the record to be expired) before we touch the file:

if (!rec || typeof rec.pid !== "number" || rec.pid === process.pid) {
  continue;
}
let pidAlive = false;
try {
  process.kill(rec.pid, 0);
  pidAlive = true;
} catch {
  pidAlive = false;
}
const expired = typeof rec.expiresAtMs === "number" && now > rec.expiresAtMs;
if (pidAlive && !expired) {
  // Live foreign record from another same-uid gateway/profile. Preserve it.
  continue;
}
rmSync(full, { force: true });

This means:

  • Live foreign records are preserved. A second OpenClaw gateway/profile under the same uid keeps its bridge record intact when another instance registers.
  • Dead-PID records are pruned. Records left behind by crashed/killed prior processes are removed.
  • Expired records are pruned. Records whose expiresAtMs has passed are removed.
  • The prune is now race-safe in a way the old predicate wasn't. process.kill(pid, 0) is a permission probe under POSIX; no signal is delivered, so it cannot interfere with the foreign process.

Why process.kill(pid, 0) is the right liveness check here

It is the standard POSIX idiom for "is this PID alive in my process namespace?" and is what every Node lifecycle tool (PM2, supervisord wrappers, etc.) uses. Caveats I am aware of and intentionally accepted:

  • PID recycling: a recycled PID belonging to a non-OpenClaw process would falsely register as "alive" and the record would be preserved. This is the safe direction (false-preserve over false-delete). Such a record will still get pruned the next time either (a) the recycled PID dies again, or (b) its expiresAtMs passes.
  • Different uid: records under /tmp/openclaw-native-hook-relays-<uid>/ are always owned by the same uid as the current process, so process.kill(pid, 0) will not get EPERM for a foreign record.

Updated diff size

+49 / −1 lines in src/agents/harness/native-hook-relay.ts (was +30 / −1; the +19 is the liveness/expiry logic and comments).

What ClawSweeper asked for that I have not added yet

The review recommends adding a regression test for the dead-stale-vs-live-foreign distinction. I have not added one because the existing native-hook-relay.test.ts does not have a fixture for spawning a foreign-PID record with a live external process, and writing one would expand the PR scope beyond a defensive fix. Happy to add it in a follow-up if maintainers prefer; in the meantime the corrected predicate is small enough to be reviewed by reading.

The full updated source is on the same branch (fix/native-hook-relay-stale-prune, head 241d574324). PR body has also been updated with the new preservation behavior in the Real behavior proof section.

— Adam

@Applied-AI-Solutions-hub

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@Applied-AI-Solutions-hub

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

@Applied-AI-Solutions-hub

Copy link
Copy Markdown
Contributor Author

Wrapping up the contributor-side work on this PR. Posting one consolidated status so maintainers have a clean picture.

What's now in this PR

  • Source diff: +49 / −1 lines, single file (src/agents/harness/native-hook-relay.ts).
  • Prune predicate now requires the foreign PID to be dead (or the record's expiresAtMs past) before removing the file. Same-uid multi-gateway records are preserved (verified live against my own parallel openclaw-beacon-gateway.service on the same uid).
  • Real behavior proof rewritten to match the six required field headings (Behavior or issue addressed, Real setup tested, Exact steps or command run after the patch, After-fix evidence, Observed result, Not tested). Evidence includes both the dead-record-pruned case and the live-foreign-record-preserved case.
  • ClawSweeper's "preserve live foreign records" finding addressed inline; reply with the full reasoning is upthread.
  • AI-assisted disclosure added at the top of the body.

CI state

Green:

  • check-lint, check-prod-types, check-test-types, check-dependencies
  • All check-additional-* boundary checks
  • All Security High * and Critical Quality * shards
  • Real behavior proof — latest completed runs are green (the intermittent "fail" rows in the PR check list are older intermediate merge SHAs from before the latest body update propagated; the latest workflow run conclusion is success)
  • All checks-fast-* shards, bundled-protocol, network-runtime-boundary

Red, and not contributor-fixable from this diff:

  • check-guardspnpm build:plugin-sdk:dts step hits the default ~2 GB Node heap during node scripts/run-tsgo.mjs -p tsconfig.plugin-sdk.dts.json --declaration true:

    FATAL ERROR: Ineffective mark-compacts near heap limit
    Allocation failed - JavaScript heap out of memory
    [ELIFECYCLE] Command failed with exit code 134.
    

    Reproduces on every push of this branch, including the small original 30-line diff. A 49-line additive change to one function in one file cannot meaningfully change TypeScript's heap profile during a full plugin-SDK declaration build. Looks like the runner is hitting the OOM line on the main build itself. Likely needs NODE_OPTIONS=--max-old-space-size=4096 on that step, or sharding the declaration build.

  • checks-node-agentic-control-plane-runtime — vitest module-mock teardown error in src/gateway/server.shared-token-hot-reload.test.ts:

    Error: [vitest] There was an error when mocking a module.
    Caused by: EnvironmentTeardownError:
      Cannot load '/src/config/io.ts?_vitest_original' after the environment was torn down.
    

    The failing path runs through src/gateway/test-helpers.server.ts, src/config/config.ts, src/config/io.ts, src/agents/subagent-orphan-recovery.ts. This PR does not modify any of those.

  • checks-node-auto-reply-reply-agent-runner — assertion failure in src/auto-reply/reply/agent-runner-memory.test.ts:1594:

    FAIL  > runMemoryFlushIfNeeded > does not count bytes from a large latest usage record as post-usage tail pressure
    AssertionError: expected [ Array(1) ] to deeply equal []
    

    Failing path is entirely under src/auto-reply/**. This PR does not modify any of those.

Per CONTRIBUTING.md ("Do not submit test/CI-only PRs for known main failures") I am flagging rather than chasing the three red checks. Happy to be corrected if a maintainer sees a connection I missed, and happy to file separate issues for any of them if they are not already tracked.

What I'd ask of a reviewer

  1. Confirm whether the same-uid multi-profile preservation matches the intended directory ownership model (the question ClawSweeper raised). My read is that since writeNativeHookRelayBridgeRecordForRegistration always writes pid: process.pid, the directory is implicitly per-process even though it's filesystem-scoped per uid; the prune is conservative against that.
  2. Direction on whether this should target release/2026.5.27 instead of main to land in 2026.5.27-beta.2.
  3. Whether a follow-up regression test (live-foreign-record preserved, dead-PID pruned) should be in this PR or a separate one.

The branch is Applied-AI-Solutions-hub:fix/native-hook-relay-stale-prune at HEAD 241d574324. Marked "Allow edits by maintainers" so a maintainer can push directly if they want to finish anything urgent without waiting on me.

Stepping back — this is now in maintainer territory. Cheers.

— Adam

@Applied-AI-Solutions-hub

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

Re-review progress:

@openclaw-barnacle openclaw-barnacle Bot added triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. proof: supplied External PR includes structured after-fix real behavior proof. and removed proof: sufficient ClawSweeper judged the real behavior proof convincing. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 28, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. merge-risk: 🚨 automation 🚨 May affect CI, automerge, proof capture, label sync, or maintainer automation. labels May 28, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 28, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 28, 2026
@moeedahmed

Copy link
Copy Markdown
Contributor

Adding one more real-world data point from a macOS production OpenClaw install, because the PR body says macOS was not tested.

Environment:

  • OpenClaw 2026.5.26
  • macOS Darwin 25.3.0 arm64, Mac Mini M4
  • Codex app-server managed by the OpenClaw gateway under ~/.openclaw/npm/.../@openclaw/codex, approval policy set to never
  • Telegram/forum sessions using Codex-backed agent turns

Observed failure before containment:

  • Multiple existing topic sessions stopped producing user-visible replies or reported Native hook relay unavailable style restriction.
  • The failure was not topic-specific: in the Operator/System session, even harmless native tool calls were blocked before execution.
  • Examples included simple shell checks such as printf/rg/ls, and GitHub/app-tool fetches. The tool call was rejected before the command itself ran.
  • Affected sessions were not doing anything inherently mutating; one Ledger audit was mostly read-only analysis but still reported that native hook relay restrictions prevented completing its full pass.

Containment result:

  • A first-class OpenClaw gateway restart restored native tool execution immediately.
  • After restart, shell smoke succeeded, GitHub PR/comment fetches succeeded, the Ledger topic session was healthy with queue depth 0, and process inspection showed only the gateway-managed Codex app-server pair. No standalone /opt/homebrew/bin/codex app-server duplicate was present.

Why this looks related to this PR:

  • The local failure shape matches a stale native-hook relay bridge/registry state rather than a Ledger skill/config problem or approval-policy problem.
  • The current gateway env already had the intended approval-policy containment, so the older waitDecision/approval fallback issue was not the blocker.
  • Restart clearing the issue is consistent with stale bridge records or stale relay generation state being refreshed.

Useful extra signal for maintainers:

  • This failure can present as silent topic stalls in Telegram/forum sessions, not only as an explicit CLI error.
  • It can block safe read-only inspection, which makes debugging the failure from inside the affected session difficult.
  • A targeted stale bridge cleanup like this PR looks like the right durable direction; the separate fail-open-for-read-only idea would be defense-in-depth, but this stale-record cleanup addresses the observed root shape.

I have not patched this production install with the PR branch yet; this comment is containment/repro evidence from the live macOS install after restoring service with a gateway restart.

@Takhoffman

Copy link
Copy Markdown
Contributor

@clawsweeper visualize

@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

🦞👀
ClawSweeper visual brief is being prepared.

I queued a read-only visual pass. It will create or update one marker-backed visual brief comment and will not trigger close, merge, repair, label, or branch changes.

Lens: auto

@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Source: #87563 (comment)
Visual model: gpt-5.5, reasoning low.

Visual brief

Lens: flow + state

Advisory only: this brief is for maintainer judgment; maintainers remain the final judges.

Current failure shape

/tmp/openclaw-native-hook-relays-<uid>/
+----------------------+----------------------+------------------+
| bridge file          | recorded owner       | actual state      |
+----------------------+----------------------+------------------+
| old ARC records      | dead PIDs            | 🐛 stale          |
| current ARC record   | live current PID     | ✅ valid          |
| Beacon record        | live foreign PID     | ✅ must preserve  |
+----------------------+----------------------+------------------+

Codex client
    |
    v
enumerates bridge files
    |
    +--> picks live current bridge ---------------> ✅ native tools run
    |
    +--> picks stale dead-PID bridge -------------> ❌ Native hook relay unavailable
Proposed registration-time cleanup

registerNativeHookRelayBridge()
    |
    v
scan uid-scoped bridge dir
    |
    +-- current PID record -----------------------> ✅ keep
    |
    +-- foreign PID alive AND not expired --------> ✅ keep
    |
    +-- foreign PID dead -------------------------> 🐛 prune
    |
    +-- expired record ---------------------------> 🐛 prune
    |
    v
write current bridge file ------------------------> ✅ usable relay set
Before / after

Before patch:
+----------------------+--------------------------------------------+
| restart/crash leaves | stale .json files remain on disk           |
| next session         | client may select dead bridge              |
| operator impact      | native tools blocked before execution      |
| multi-gateway risk   | not addressed by stale cleanup alone       |
+----------------------+--------------------------------------------+

After patch:
+----------------------+--------------------------------------------+
| registration         | prunes dead/expired records                |
| live foreign bridge  | preserved when PID is alive and unexpired  |
| operator impact      | supplied proof shows native tools recover  |
| multi-gateway risk   | narrowed, but semantics need maintainer OK |
+----------------------+--------------------------------------------+

Proof map:

✅ WSL2/Ubuntu live proof:
   5 dead PID bridge files pruned
   current ARC bridge kept
   live Beacon same-uid bridge kept
   date/time date worked after cleanup

✅ Regression tests claimed in PR:
   dead foreign PID is pruned
   expired foreign record is pruned
   live unexpired foreign record is preserved

⚠️ macOS signal:
   separate commenter reports similar production failure shape
   restart restored service
   PR branch not tested there

⚠️ Remaining CI/process concern:
   workflow heap change is bundled with runtime fix
   some red checks are described as unrelated/pre-existing

Maintainer decision points:

🧑‍⚖️ Is uid-scoped cleanup acceptable?
    Keep live foreign records; prune dead or expired records.

🧑‍⚖️ Is PID-recycling false-preserve acceptable?
    Current behavior favors preserving suspicious records until expiry/death.

🧑‍⚖️ Should the CI heap increase ride with this PR?
    It fixes a check-guards OOM path but is separable from relay behavior.

🧑‍⚖️ Is this enough for the targeted bug?
    This addresses stale filesystem bridge records, not the broader fail-open
    policy or all native-hook-relay unavailable modes.

Legend: ✅ expected/proven, ❌ failing path, ⚠️ unresolved risk, 🐛 confirmed bug path, 🧑‍⚖️ maintainer judgment point.

Maintainer ruling

Self-heals stale native hook bridge files after crashes/restarts while preserving live same-uid gateway records.

Shared uid bridge-dir semantics, PID recycling false-preserve behavior, and bundled CI heap change need maintainer acceptance.

Exact-head required checks green or explicitly judged unrelated; maintainer confirmation that dead/expired cleanup semantics are correct across supported gateway/profile layouts.

Maintainer review focused on shared bridge-dir lifecycle, bundled workflow heap change, and exact-head check status.

Should this targeted stale bridge-file cleanup land as-is, or should the CI heap change / broader relay policy questions be split from the runtime fix?

@Takhoffman

Copy link
Copy Markdown
Contributor

@clawsweeper automerge

@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

🦞🔧
ClawSweeper picked up the repair feedback.

Source: clawsweeper[bot]
Feedback: trusted ClawSweeper review contains P-severity findings (sha=613071ef179bd015ec9071d5dde2edc1ad3d9424)
Action: repair worker queued. Run: https://github.com/openclaw/clawsweeper/actions/runs/26585714102
Model: gpt-5.5

I will update this PR branch, or open a safe credited replacement, if the repair worker finds a narrow fix.

Automerge progress:

  • 2026-05-28 15:50:04 UTC review requested repair 613071ef179b (trusted ClawSweeper review contains P-severity findings (sha=613071ef179bd015ec...)
  • 2026-05-28 15:43:42 UTC review queued 613071ef179b (queued)

@clawsweeper clawsweeper Bot added clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge status: 🚀 automerge armed This PR is in ClawSweeper's automerge lane. and removed status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 28, 2026
@Takhoffman

Copy link
Copy Markdown
Contributor

@clawsweeper automerge

@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

Re-review progress:

@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper 🐠 reef update

Thanks for the work on this. ClawSweeper did not have permission to update this branch directly, so it opened a narrow replacement PR instead. that's a branch access thing, not a knock on the contribution.

Why replacement: ClawSweeper could not update the source PR branch directly; GitHub did not grant sufficient push rights to the bot for that branch.
Replacement PR: #87706
Why close: this run explicitly closes the superseded source PR after the credited replacement PR is open, so review continues in one place.
This source PR is being closed only under the explicit source-close setting for this ClawSweeper run.
The replacement PR carries the original credit trail forward.
Co-author credit kept:

fish notes: model gpt-5.5, reasoning high; reviewed against baf3826.

@clawsweeper clawsweeper Bot closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge merge-risk: 🚨 automation 🚨 May affect CI, automerge, proof capture, label sync, or maintainer automation. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: S status: 🚀 automerge armed This PR is in ClawSweeper's automerge lane.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants