Skip to content

[codex] Add generic OpenClaw code mode#80600

Merged
steipete merged 5 commits into
mainfrom
codex/code-mode-runtime
May 15, 2026
Merged

[codex] Add generic OpenClaw code mode#80600
steipete merged 5 commits into
mainfrom
codex/code-mode-runtime

Conversation

@steipete

@steipete steipete commented May 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds experimental OpenClaw code mode: an opt-in agent tool mode where the provider sees only exec and wait, while enabled OpenClaw/plugin/MCP/client tools remain available through a run-scoped hidden catalog bridge.

This is OpenClaw-owned and generic. It is separate from Codex Code mode, does not import Codex internals, and keeps nested tool calls inside the existing OpenClaw approval, hook, session, transcript, and accounting paths.

What Changed

  • Adds the code-mode runtime in src/agents/code-mode.ts plus a QuickJS-WASI worker entry in src/agents/code-mode.worker.ts.
  • Supports JavaScript and TypeScript exec cells, with lazy TypeScript transpilation and bounded timeout, output, snapshot, pending-call, and catalog-search limits.
  • Injects a small guest API: tools.search(...), tools.call(...), text(...), json(...), and yield_control(...).
  • Hides the normal enabled tool catalog behind code mode when tools.codeMode.enabled === true, leaving only exec and wait provider-visible.
  • Integrates code mode with embedded-runner replay/transcript persistence and OpenAI Responses tool-surface enforcement.
  • Adds config schema/help/labels for tools.codeMode and documents the feature at docs/reference/code-mode.md.
  • Registers the worker build output and dependency metadata needed for packaged installs.

Runtime Contract

Code mode activates only when explicitly enabled and when the run installs the code-mode controls. Plain config objects such as tools.codeMode: { timeoutMs: 5000 } do not implicitly enable it.

When active, the model can search and call hidden enabled tools from bounded guest code, but those calls still route through OpenClaw's existing tool execution path. Raw no-tool runs, disableTools, and empty allowlists do not trigger code-mode provider-surface enforcement.

Suspended runs are bounded before persistence, and completed outputs are bounded before returning to the model. The worker URL resolver handles built package layouts so dist/agents/code-mode.js resolves the sibling dist/agents/code-mode.worker.js entry.

Verification

  • pnpm deadcode:dependencies
  • pnpm lint --threads=8
  • pnpm check:test-types
  • pnpm test src/agents/code-mode.test.ts src/agents/pi-embedded-runner/openai-stream-wrappers.test.ts src/agents/pi-embedded-runner/run/attempt.test.ts
  • pnpm build
  • pnpm check:docs
  • git diff --check origin/main...HEAD

Real behavior proof

Behavior addressed: Opt-in OpenClaw code mode exposes only exec and wait to the OpenAI Responses model while nested OpenClaw tools remain available behind the hidden catalog bridge.

Real environment tested: Local OpenClaw embedded Pi agent on macOS from rebased PR head bd4da03e68dfeb4e8a899a4d059e9f2769184290, isolated OPENCLAW_STATE_DIR, OpenAI Responses provider, live openai/gpt-5.5, real OPENAI_API_KEY from the shell.

Exact steps or command run after this patch: Ran pnpm openclaw agent --local --agent main --session-id code-mode-live-ship-<timestamp> --model openai/gpt-5.5 --thinking low --json --timeout 240 with tools.allow=["session_status"], tools.codeMode.enabled=true, OPENCLAW_AGENT_HARNESS_FALLBACK=none, OPENCLAW_DEBUG_CODE_MODE=1, and OPENCLAW_DEBUG_MODEL_PAYLOAD=tools.

Evidence after fix: Copied live terminal output from the run:

live_state=/tmp/openclaw-code-mode-live-ship.GrMeEM
assistant=CODEMODE_LIVE_SHIP_OK exec_used=true wait_seen=true session_status_seen=true
transcripts=4
live_code_mode_ship_e2e=passed

Observed result after fix: Provider payload debug showed tools=count=2 names=exec,wait; the saved transcripts contained provider-visible exec and wait plus nested session_status; the assistant returned exactly CODEMODE_LIVE_SHIP_OK.

What was not tested: No additional live providers beyond OpenAI Responses/gpt-5.5 in this final ship rerun; local focused tests and check:changed cover the non-live runtime hardening paths.

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation agents Agent runtime and tooling size: XL maintainer Maintainer-authored PR labels May 11, 2026
@steipete steipete marked this pull request as ready for review May 11, 2026 08:41

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd8ec3e1ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/code-mode.ts Outdated
}

function rejectsModuleAccess(code: string): boolean {
return /(^|[^\w$])import\s*(?:\(|[\s{*]|\w)|(^|[^\w$])require\s*\(/u.test(code);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Narrow module-access regex to avoid blocking valid code

The new module-access guard rejects many safe snippets because the import branch matches any identifier that starts with import (for example const important = 1 or even string content containing important). In rejectsModuleAccess, import\s*(?:\(|[\s{*]|\w) allows \w immediately after import, so regular code is incorrectly treated as forbidden module access and exec returns code mode module access is disabled. This makes code mode fail on common user programs and should be restricted to actual import syntax (or replaced with a parser-based check).

Useful? React with 👍 / 👎.

@clawsweeper

clawsweeper Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Summary
The PR adds an opt-in generic OpenClaw code mode with exec/wait controls, a QuickJS-WASI worker, hidden tool-catalog bridging, config/schema/docs/tests, build entries, and a new quickjs-wasi dependency.

Reproducibility: yes. for the review findings: source inspection shows maxOutputBytes is not available inside the worker before returning values/output, and the module regex still matches unmasked template literal text. I did not run the PR branch because this review is read-only.

Real behavior proof
Needs real behavior proof before merge: Needs contributor action: the PR body lists tests/checks only and does not show a redacted after-fix real code-mode run for the latest head; terminal output, logs, screenshots, recordings, or linked artifacts are acceptable after private details are redacted. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, ask a maintainer to comment @clawsweeper re-review.

Next step before merge
Manual review is required because this protected-label PR adds a security-sensitive code-execution/config surface and lacks contributor real behavior proof; the source findings are actionable but not enough for automation to merge or close.

Security
Needs attention: Needs attention: the new code-execution worker can return oversized guest data to the parent before the configured output limit is enforced.

Review findings

  • [P1] Enforce output caps inside the worker — src/agents/code-mode.worker.ts:385-388
  • [P2] Mask template literals before scanning imports — src/agents/code-mode.ts:369
Review details

Best possible solution:

Keep the feature opt-in, but enforce all resource limits inside the worker before postMessage, fix source scanning false positives, add real-run proof, and require maintainer/security sign-off before merge.

Do we have a high-confidence way to reproduce the issue?

Yes for the review findings: source inspection shows maxOutputBytes is not available inside the worker before returning values/output, and the module regex still matches unmasked template literal text. I did not run the PR branch because this review is read-only.

Is this the best way to solve the issue?

No, not as-is. The direction is plausible, but the maintainable path is to move serialized output enforcement into the worker, replace or improve the source scanner, then validate a real code-mode run before merge.

Full review comments:

  • [P1] Enforce output caps inside the worker — src/agents/code-mode.worker.ts:385-388
    The worker returns completed value and output to the parent before maxOutputBytes is checked, and its config does not include that limit. A hostile cell can therefore send data far above the configured output cap across worker_threads first, defeating the resource boundary the feature is trying to provide.
    Confidence: 0.88
  • [P2] Mask template literals before scanning imports — src/agents/code-mode.ts:369
    The module-access scanner masks comments and single/double-quoted strings, but leaves template literal text intact. Safe code such as const msg = import docs later; still matches the import regex and fails before QuickJS runs.
    Confidence: 0.86

Overall correctness: patch is incorrect
Overall confidence: 0.88

Security concerns:

  • [medium] Output cap is enforced after worker transfer — src/agents/code-mode.worker.ts:385
    Because maxOutputBytes is not part of the worker config and completed/waiting results are posted back before parent-side checks, model-authored code can push much larger serialized data through the worker boundary than operators configured.
    Confidence: 0.88

Acceptance criteria:

  • node scripts/run-vitest.mjs src/agents/code-mode.test.ts
  • node scripts/run-vitest.mjs src/agents/pi-embedded-runner/openai-stream-wrappers.test.ts src/agents/pi-embedded-runner/run/attempt.test.ts
  • pnpm build
  • pnpm check:docs
  • pnpm deadcode:dependencies

What I checked:

  • protected PR state: The provided live PR context shows the PR is open at head d8d8f75 with labels including maintainer, dependencies-changed, agents, commands, docs, and size: XL. (d8d8f7553785)
  • current main does not implement generic code mode: Current main has Tool Search and code-mode diagnostics references, but no tools.codeMode config schema, no src/agents/code-mode.ts, and no generic exec/wait QuickJS implementation matching this PR. (3f80f889fab7)
  • new feature surface in PR head: The PR head adds code-mode control names, resource defaults, active run state, config resolution, and worker integration for the new exec/wait surface. (src/agents/code-mode.ts:30, d8d8f7553785)
  • worker output limit gap: The worker-side CodeModeConfig lacks maxOutputBytes and runExec returns completed value/output directly, so oversized guest output crosses worker_threads before the parent enforces the limit. (src/agents/code-mode.worker.ts:385, d8d8f7553785)
  • parent-side limit is after worker response: The parent calls runCodeModeWorker first and only then enforces maxOutputBytes through enforceResultLimit/enforceOutputLimit. (src/agents/code-mode.ts:655, d8d8f7553785)
  • module guard false positive: maskCodeLiteralsAndComments handles comments and single/double-quoted strings but not template literal text; the import regex therefore still rejects safe template strings containing text like import docs later. (src/agents/code-mode.ts:369, d8d8f7553785)

Likely related people:

  • steipete: Introduced the current Tool Search runtime on main and has recent model-transport/debugging and dependency work in the same runtime surface; this PR also extends that surface. (role: feature owner and recent area contributor; confidence: high; commits: 93acb3815958, b9185703bc6a, 15cf49222f92; files: src/agents/tool-search.ts, src/agents/model-transport-debug.ts, src/agents/pi-embedded-runner/openai-stream-wrappers.ts)
  • jalehman: Recent main-branch work stabilized code-mode follow-up tool display and replay, touching the same embedded-runner and transcript/display boundaries this PR depends on. (role: recent adjacent owner; confidence: medium; commits: 4bfd7416f0f9; files: src/agents/pi-embedded-runner/run/attempt.ts, src/agents/tool-display-common.ts, src/agents/session-file-repair.test.ts)
  • vincentkoc: Recent commits changed embedded-runner attempt behavior around bootstrap/context handling, an adjacent hot path for code-mode activation. (role: recent embedded runner contributor; confidence: medium; commits: a47132734ba2, af3d9333aa8a; files: src/agents/pi-embedded-runner/run/attempt.ts)
  • scoootscooob: Recent main-branch work projected Tool Search target calls in transcripts, which is relevant to hidden catalog calls and replay semantics. (role: Tool Search transcript contributor; confidence: medium; commits: 769b11732bea; files: src/agents/tool-search.ts)

Remaining risk / open question:

  • No redacted after-fix real-run proof is present for the latest head, so tests alone do not prove worker packaging and provider-surface behavior in a real agent run.
  • The PR adds a new model-authored code execution surface and dependency, so maintainer/security acceptance is needed even after narrow code findings are fixed.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 3f80f889fab7.

@akrimm702 akrimm702 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one issue that looks worth addressing before merging.

Comment thread src/agents/code-mode.worker.ts Outdated
const output = takeOutput(vm);
const resultHandle = getResultHandle(vm);
try {
if (

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Fail instead of suspending with no bridge requests

When Code Mode user code awaits a promise that is not backed by a bridge call, for example await new Promise(() => {}), this branch returns waiting with an empty pendingRequests list. The wait tool then restores the snapshot, reaches the same pending-promise/no-request state again, and returns another empty waiting result. At that point the model has no actionable pending tool call and can only remain stuck until the snapshot expires.

Please only suspend when there is at least one host bridge request to settle. Otherwise, fail cleanly with a timeout or unresolved-pending-promise error.

@steipete steipete force-pushed the codex/code-mode-runtime branch 4 times, most recently from 1bb1d29 to f61a088 Compare May 11, 2026 09:42

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f61a088a9d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/code-mode.ts Outdated
Comment on lines +673 to +674
const ready = await waitForPending(state.pending, state.config.timeoutMs);
if (!ready) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard code-mode waits against concurrent resume races

runWait reads shared state from activeRuns, then awaits waitForPending(...) before removing the run from the map; two parallel wait calls for the same runId can both pass this point and both resume the same snapshot. In that scenario, the guest program is resumed twice and nested tool calls can execute twice, which can duplicate side effects on external tools. Mark the run as in-progress (or remove it) before the await, with rollback if still waiting.

Useful? React with 👍 / 👎.

@yaanfpv

yaanfpv commented May 11, 2026

Copy link
Copy Markdown
Contributor

Tested this against the built worker (dist/agents/code-mode.worker.js) on macOS Node 26. Three things on top of clawsweeper's review that might be worth a look.

Clawsweeper P1 and P2 both reproduce live. const important = 1 is rejected with "code mode module access is disabled" in 3ms (the regex \w branch matches import followed by any word char, so any identifier starting with import blows up before reaching QuickJS). And await new Promise(() => {}) returns status: "waiting" with pendingRequests: [] and a 1.3MB snapshot from the built worker. Each subsequent wait call sets a fresh expiresAt: now + snapshotTtlSeconds * 1000 in snapshotState (src/agents/code-mode.ts:530), so the snapshot effectively never times out as long as the model keeps polling.

Timeout error message is empty. Triggered the QuickJS interrupt handler with while (true) {} against the built worker and got back:

{
  "status": "failed",
  "error": "    at <anonymous> (openclaw-code-mode:user.js:1:1)\n    at <eval> (openclaw-code-mode:user.js:3:1)\n",
  "code": "internal_error",
  "output": []
}

No "timeout exceeded" string, just an empty stack frame. Operators triaging a code-mode failure will not have an obvious signal that the cause was the timeout vs a generic JS error. Probably worth raising a typed error in the interruptHandler path in src/agents/code-mode.worker.ts:236.

activeRuns has no max-cap. src/agents/code-mode.ts:115 is a plain Map<string, CodeModeRunState>() with eviction only via TTL or completion (removeExpiredRuns and activeRuns.delete after resume). Combined with P2 above, a misbehaving model could enter the empty-wait loop, refresh TTL each call, and grow the map until snapshot bytes pressure the gateway. snapshotTtlSeconds defaults to 900, so this is bounded in time but unbounded in concurrent count. A simple max-entries guard inside removeExpiredRuns would close it.

Heads up on CI: the checks-node-core failure traces to scripts/lint-suppressions.test.ts rejecting the new // oxlint-disable-next-line unicorn/require-post-message-target-origin on src/agents/code-mode.worker.ts:478. The allowlist in that test needs the entry added.

Scope: focus of my read was parent-process state management, ctx scoping, and worker control flow against the built artifact. Did not exercise memory limits, the snapshot/output byte caps, the TypeScript transpile path, or the full pi-coding-agent integration with a live model.

Re-review progress:

@pashpashpash pashpashpash left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine to me as it keeps a clean separation from codex harness mode (codex harness has its own code mode that we use).

as long as this pr keeps doing that, and generic code mode stuff doesn't leak into codex harness mode - i'm happy

akrimm702 commented May 11, 2026

Copy link
Copy Markdown
Contributor

Re-reviewed after the approval. I do not think this is merge-ready yet.

The approval looks scoped to keeping generic OpenClaw code mode separate from Codex harness mode. That separation is important, but it does not clear the current runtime/code-execution blockers.

Latest merge-base recheck:

  • Current main is 8aa286476db2ae79b85155fc695f93b231e70274.
  • PR head is still f61a088a9d81c28ea6bae2e3590fab32db3f78d5.
  • The branch is 408 commits behind / 4 ahead of current main.
  • git merge-tree --write-tree main pr80600 exits non-zero with a real content conflict in docs/.generated/config-baseline.sha256.

Current blockers I still see:

  • The PR is not cleanly mergeable into current main without resolving at least the generated config-baseline conflict above.
  • Required checks are still failing on head f61a088a9d81: Real behavior proof, checks-node-core, and checks-node-core-support-boundary.
  • The existing unresolved, non-outdated review threads are still valid:
    • const important = 1; return important; still fails before QuickJS with code mode module access is disabled. because the import...\w regex branch matches identifiers beginning with import.
    • await new Promise(() => {}) still returns status: "waiting" with pendingToolCalls: []; a subsequent wait returns another empty waiting state, leaving no actionable host work for the model.
    • runWait still reads activeRuns, then awaits waitForPending(...), and only deletes the run afterward, so concurrent waits can still resume the same snapshot twice and duplicate post-await side effects.

Local verification while re-reviewing:

  • corepack pnpm test src/agents/code-mode.test.ts passes.
  • corepack pnpm test src/agents/code-mode.test.ts test/scripts/lint-suppressions.test.ts fails because the new production suppression is not in the explicit allowlist:
    src/agents/code-mode.worker.ts|unicorn/require-post-message-target-origin|1
  • A direct local smoke reproduced the two runtime behaviors above: important is rejected as module access, and the empty-promise case returns empty waiting twice while leaving an active run.

Recommendation: keep the approval as signal for the Codex-harness separation, but do not merge until the unresolved runtime safety threads, failing CI/proof gate, and current-main merge conflict are addressed.

@github-actions

Copy link
Copy Markdown
Contributor

Dependency Changes Detected

This PR changes dependency-related files. Maintainers should confirm these changes are intentional.

Changed files:

  • package.json
  • pnpm-lock.yaml

Maintainer follow-up:

  • Review whether the dependency changes are intentional.
  • Inspect resolved package deltas when lockfile or workspace dependency policy changes are present.
  • Run pnpm deps:changes:report -- --base-ref origin/main --markdown /tmp/dependency-changes.md --json /tmp/dependency-changes.json locally for detailed release-style evidence.

@openclaw-barnacle openclaw-barnacle Bot added the commands Command implementations label May 14, 2026
@socket-security

socket-security Bot commented May 14, 2026

Copy link
Copy Markdown

No dependency changes detected. Learn more about Socket for GitHub.

👍 No dependency changes detected in pull request

@steipete steipete force-pushed the codex/code-mode-runtime branch from d8d8f75 to bd4da03 Compare May 15, 2026 03:09
@openclaw-barnacle openclaw-barnacle Bot removed the commands Command implementations label May 15, 2026
@steipete

steipete commented May 15, 2026

Copy link
Copy Markdown
Contributor Author

Verification for current head bd4da03e68dfeb4e8a899a4d059e9f2769184290:

  • Local focused tests: pnpm test src/agents/code-mode.test.ts src/agents/pi-embedded-runner/openai-stream-wrappers.test.ts src/agents/pi-embedded-runner/run/attempt.test.ts src/commands/doctor-legacy-config.migrations.test.ts test/scripts/lint-suppressions.test.ts -- --reporter=verbose
  • Local docs/config: pnpm config:docs:check, pnpm check:docs
  • Local changed gate after final rebase: env -u OPENCLAW_TESTBOX pnpm check:changed
  • Live model E2E after final rebase: pnpm openclaw agent --local --agent main --session-id code-mode-live-ship-<timestamp> --model openai/gpt-5.5 --thinking low --json --timeout 240 with isolated OPENCLAW_STATE_DIR, tools.allow=["session_status"], tools.codeMode.enabled=true, OPENCLAW_AGENT_HARNESS_FALLBACK=none, OPENCLAW_DEBUG_CODE_MODE=1, and OPENCLAW_DEBUG_MODEL_PAYLOAD=tools; observed CODEMODE_LIVE_SHIP_OK, provider-visible exec,wait, and nested session_status in transcripts.
  • GitHub: CI run 25898082897 success; CodeQL run 25898082866 success; CodeQL Critical Quality run 25898082911 success; OpenGrep run 25898082848 success; Real behavior proof run 25898090448 success.

Re-review progress:

@steipete steipete merged commit 0db0979 into main May 15, 2026
117 of 122 checks passed
@steipete steipete deleted the codex/code-mode-runtime branch May 15, 2026 03:16

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bd4da03e68

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/code-mode.ts
}
continue;
}
if (char === "'" || char === '"') {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Mask template literals before module-access scan

maskCodeLiteralsAndComments only strips ' and " strings, so backtick template literals are scanned as code by rejectsModuleAccess. That makes safe snippets like const msg = `import docs later`; fail with code mode module access is disabled even though they never import a module. Fresh evidence: this exact snippet still matches the current regex path because the template literal text is left unmasked. Please mask template-literal contents (while still scanning ${...} expressions) before running the module-access regex.

Useful? React with 👍 / 👎.

@steipete

Copy link
Copy Markdown
Contributor Author

Landed via rebase merge onto main.

  • Source head: bd4da03e68dfeb4e8a899a4d059e9f2769184290
  • Land commit: 0db09793656809bc1c17c11946ded9da99b29acd
  • Gate: local focused tests/docs/check:changed, live OpenAI gpt-5.5 code-mode E2E, CI 25898082897, CodeQL 25898082866, CodeQL Critical Quality 25898082911, OpenGrep 25898082848, Real behavior proof 25898090448.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling dependencies-changed PR changes dependency-related files docs Improvements or additions to documentation maintainer Maintainer-authored PR size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants