Skip to content

fix(codex): share native hook relay state#86340

Closed
Kaspre wants to merge 1 commit into
openclaw:mainfrom
Kaspre:fix/native-hook-relay-shared-state
Closed

fix(codex): share native hook relay state#86340
Kaspre wants to merge 1 commit into
openclaw:mainfrom
Kaspre:fix/native-hook-relay-shared-state

Conversation

@Kaspre

@Kaspre Kaspre commented May 25, 2026

Copy link
Copy Markdown
Contributor

Problem

Codex native hook relays can be registered in one module instance and invoked from another module instance in the same process. When those instances each keep their own in-memory relay maps, OpenClaw can install Codex hook commands but later fail to find the active relay for the same run. A separate race also lets a late hook command for an old run reuse the stable relay id after a replacement run has registered.

Fixes #73723 and replaces the abandoned approach in #73950.

Change and value

This PR moves native hook relay registry state onto a process-global symbol so duplicate module instances share the same relay, bridge, invocation, and permission-approval state. It also adds a per-registration generation token to newly generated hidden hook commands and validates that token through the direct bridge, gateway fallback, and Codex app-server approval bridge.

The result is that current hook commands still reach the active relay, while stale commands fail closed instead of invoking a replacement registration with the same stable relay id.

Who is affected, and who is not

  • Affected: Codex app-server/native hook relay users, especially runs that load duplicate agent-harness runtime instances or overlap old and replacement Codex runs for the same session relay id.
  • Not affected: ordinary plugin hooks, non-Codex native relay providers, and public user-facing CLI commands. hooks relay remains a hidden internal command.
  • Compatibility: buildNativeHookRelayCommand() and the hidden CLI still tolerate omitted generations so older already-written commands fail through the normal unavailable path rather than crashing at option parsing.

Why now

#73723 is still open, and #73950 has gone stale. Recent cleanup work fixed one pending-unregister race, but it did not make duplicate module instances share relay state or prevent late stale commands from crossing into replacement registrations.

Implementation

  • Store relay registrations, bridge records, invocation history, pending permission approvals, rate-limit windows, and allow-always approvals in a process-global Symbol.for("openclaw.nativeHookRelay.state") object.
  • Add an internal required generation to active registrations while keeping the exported handle field optional for SDK/source compatibility.
  • Include --generation in newly generated hidden hook commands.
  • Require matching generations for direct bridge requests, gateway nativeHook.invoke, and the Codex app-server approval bridge's in-process relay invocation.
  • Return/recognize stale bridge registrations as a fail-closed unavailable relay result instead of falling back to a newer gateway registration.
  • Keep replacement pending permission approvals from being deleted by stale approval finalizers.

Verification

  • Local lightweight checks:
    • git diff --check origin/main...HEAD
    • ./node_modules/.bin/oxfmt --check --threads=1 extensions/codex/src/app-server/approval-bridge.ts extensions/codex/src/app-server/approval-bridge.test.ts extensions/codex/src/app-server/native-hook-relay.test.ts src/agents/harness/native-hook-relay.ts src/agents/harness/native-hook-relay.test.ts src/cli/hooks-cli.ts src/cli/native-hook-relay-cli.ts src/cli/native-hook-relay-cli.test.ts src/gateway/server-methods/native-hook-relay.ts src/gateway/server-methods/native-hook-relay.test.ts
  • Review gates:
    • Codex review after rebase against origin/main: no actionable findings.
    • Read-only adversarial lanes during development found stale-generation and app-server approval-bridge gaps; both were fixed before final validation.
  • Final Crabbox/GCP validation:
    • Lease cbx_9c0044dd81ae, slug brisk-hermit, GCP us-east1-b, Spot e2-standard-8, no local sync.
    • Fresh GitHub clone of openclaw/openclaw, fetched Kaspre/openclaw branch fix/native-hook-relay-shared-state, checked out exact head 2a37ac2fc5945c7a3020dfa92c47625ea51dbd67.
    • Live proof command output:
[proof] head=2a37ac2fc5945c7a3020dfa92c47625ea51dbd67
[proof] duplicate-module-invoke=0
[proof] stale-generation-error=native hook relay bridge stale registration
[proof] replacement-generation-invoke=0
  • Focused changed-path tests passed: src/agents/harness/native-hook-relay.test.ts, src/cli/native-hook-relay-cli.test.ts, src/gateway/server-methods/native-hook-relay.test.ts, extensions/codex/src/app-server/native-hook-relay.test.ts, extensions/codex/src/app-server/approval-bridge.test.ts (118 tests across 4 Vitest shards).
  • pnpm check:changed passed on the same exact head.

Real behavior proof

Behavior or issue addressed: Codex native hook relay state remains available across duplicate module instances, and a stale relay command from an old stable relay-id registration cannot invoke the replacement registration.

Real environment tested: Crabbox GCP Spot VM (us-east1-b, e2-standard-8) from a fresh GitHub clone, checking out pushed fork head 2a37ac2fc5945c7a3020dfa92c47625ea51dbd67.

Exact steps or command run after this patch: On the Crabbox VM, cloned https://github.com/openclaw/openclaw.git, fetched https://github.com/Kaspre/openclaw.git fix/native-hook-relay-shared-state, checked out FETCH_HEAD, installed Node 24.15.0 and pnpm@11.0.8, then ran a node --import tsx --input-type=module -e ... script that imported src/agents/harness/native-hook-relay.ts twice, registered a relay in one module instance, invoked it from the duplicate module instance, replaced the same stable relay id, attempted a stale-generation bridge invocation, and then invoked the replacement generation. The same VM then ran the focused Vitest files and pnpm check:changed.

Evidence after fix: Terminal capture from lease cbx_9c0044dd81ae:

[proof] head=2a37ac2fc5945c7a3020dfa92c47625ea51dbd67
[proof] duplicate-module-invoke=0
[proof] stale-generation-error=native hook relay bridge stale registration
[proof] replacement-generation-invoke=0

Observed result after fix: The duplicate module successfully invoked the current relay (exitCode=0), the stale generation was rejected with native hook relay bridge stale registration, the replacement generation invoked successfully (exitCode=0), focused changed-path tests passed, and pnpm check:changed exited 0.

What was not tested: No changed-path behavior gap remains. The full unrelated repository test suite is left to GitHub CI; this PR ran live relay proof, focused changed-path tests, and check:changed on the exact pushed head.

AI Assistance

I used Codex to prepare the patch and verification.

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime cli CLI command changes agents Agent runtime and tooling extensions: codex size: L proof: supplied External PR includes structured after-fix real behavior proof. labels May 25, 2026
@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

Thanks for the context here. I swept through the related work, and this is now duplicate or superseded.

Keep open for merge review. The branch addresses a real Codex native hook relay bug with source-aligned shared state, stale-generation protection, focused coverage, and sufficient exact-head Crabbox proof; I found no blocking correctness or security finding. The remaining merge question is maintainer acceptance of the hidden relay compatibility behavior plus normal required CI.

Canonical path: Close this PR as superseded by #73950.

So I’m closing this here and keeping the remaining discussion on #73950.

Review details

Best possible solution:

Close this PR as superseded by #73950.

Do we have a high-confidence way to reproduce the issue?

Yes at source level: current main still keeps relay state module-local and throws the reported relay-not-found error when the active id is missing from that module instance. I did not rerun the live current-main Telegram/Slack scenario, but the linked issue and comments provide repeated live logs.

Is this the best way to solve the issue?

Yes, the proposed direction is the narrow maintainable fix for the duplicate-module root cause: share relay-owned state once and require per-registration generation for stale stable ids. The only caveat is the deliberate fail-closed handling for generationless hidden commands across upgrades.

Security review:

Security review cleared: The security-sensitive native hook relay and approval path was reviewed; the diff tightens stale relay invocation and introduces no dependency, workflow, secret, package, or supply-chain change.

AGENTS.md: found and applied where relevant.

What I checked:

  • linked superseding PR: fix(codex): share native hook relay registry #73950 (fix(codex): share native hook relay registry) is still open as the canonical replacement.
  • cluster evidence: the durable review links that PR in the work cluster or recommended risk path.
  • no human follow-up: live comments and timeline hydrated by apply contain no non-automation activity after the ClawSweeper review.

Likely related people:

  • pashpashpash: Authored the merged Codex native hook bridge foundation and follow-up MCP/native hook hardening on the central relay, CLI, gateway, and Codex adapter paths. (role: introduced native relay behavior; confidence: high; commits: 7a958d920c88, 34fb96622eb6, bf7d156bb099; files: src/agents/harness/native-hook-relay.ts, src/cli/native-hook-relay-cli.ts, src/gateway/server-methods/native-hook-relay.ts)
  • pash-openai: Authored the merged cross-process/direct bridge and stable relay-id work that this PR builds on for replacement and stale-registration behavior. (role: recent relay reliability contributor; confidence: medium; commits: 3b5dab372ac2; files: src/agents/harness/native-hook-relay.ts, src/cli/native-hook-relay-cli.ts, extensions/codex/src/app-server/run-attempt.ts)
  • shakkernerd: Authored the allow-always native approval reuse work whose cache is now included in the PR's process-global relay state. (role: recent relay approval contributor; confidence: high; commits: f011d6bc0a75; files: src/agents/harness/native-hook-relay.ts, src/agents/harness/native-hook-relay.test.ts)
  • Kaspre: Authored the merged deferred-unregister race fix and prior Codex native approval/side-question work on the same relay and app-server surfaces, in addition to this PR branch. (role: recent adjacent fixer and current PR author with prior merged history; confidence: high; commits: 96959ec3d78e, 69a0c925b890; files: src/agents/harness/native-hook-relay.ts, extensions/codex/src/app-server/run-attempt.ts, extensions/codex/src/app-server/approval-bridge.ts)
  • vincentkoc: Authored the before-agent-finalize hook work that extended the same native relay event path and Codex hook config tests. (role: adjacent hook feature contributor; confidence: medium; commits: f3accc753cf2; files: src/agents/harness/native-hook-relay.ts, extensions/codex/src/app-server/native-hook-relay.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against a5d5604198c3.

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. labels May 25, 2026
@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper applied the proposed close for this PR.

@clawsweeper clawsweeper Bot closed this May 25, 2026

Kaspre commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper reopen

For maintainer context after the automated duplicate/superseded closure: I don’t think #73950 fully supersedes this branch in its current state.

#73950 is still the original shared-registry direction, but as of this check it is stale/conflicting and its own ClawSweeper review says it needs changes before merge: rebase, include the permissionAllowAlwaysApprovals cache in shared state, and satisfy real-behavior-proof checks.

This PR’s branch was built from current main and addressed those gaps plus the later stale stable-relay-id race:

  • shares all relay-owned in-memory state, including permissionAllowAlwaysApprovals
  • adds per-registration generation validation through hidden CLI relay commands, direct bridge, gateway nativeHook.invoke, and Codex app-server approval handling
  • rejects stale bridge/gateway invocations instead of letting an old stable relay id invoke a replacement registration
  • includes focused coverage across relay, CLI, gateway, and Codex app-server paths
  • passed exact-head Crabbox/GCP proof and pnpm check:changed at 2a37ac2fc5945c7a3020dfa92c47625ea51dbd67

If maintainers prefer #73950 as the canonical PR, the actionable path is probably to carry this branch’s additional generation/stale-id and app-server/gateway/CLI coverage into #73950. Otherwise, #86340 is the current complete branch I’d recommend reviewing for the fix.

@RomneyDa

RomneyDa commented May 26, 2026

Copy link
Copy Markdown
Member

Status update: I’m carrying the remaining work from this branch into #73950 as the canonical PR, preserving #73950’s original commit and adding your relay-state/stale-generation fixes as a new commit with you on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling cli CLI command changes extensions: codex gateway Gateway runtime merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: L status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Native hook relay remains unavailable after gateway restart and hooks report 5/5 ready

2 participants