feat(security): runtime gateway token injection#2485
Conversation
…atibility Generate the gateway auth token at container startup instead of Docker build time. This gives each container a unique token (not shared across images, not extractable from Docker layer history) while keeping openclaw tui and all OpenClaw tools working. Root mode: inject_gateway_token() writes the token into openclaw.json before chattr +i locks the config. export_gateway_token() also exports it to OPENCLAW_GATEWAY_TOKEN env var, rc files, and a /tmp sourced file. Non-root mode: export_gateway_token() generates a fresh token and delivers it via env var + rc files + /tmp file. Cannot write openclaw.json (root:root 444) so OpenClaw reads the env var instead. Both modes ensure openclaw tui works in connect sessions because OpenClaw 2026.4.9 checks OPENCLAW_GATEWAY_TOKEN env var first. Supersedes reverted commit 51aa6af which broke openclaw tui by removing the token from openclaw.json without providing an alternative path for interactive sessions. Fixes #2480
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughImages stop baking OpenClaw gateway auth tokens. At container startup a per-instance token is generated: root mode injects it into Changes
Sequence DiagramsequenceDiagram
participant Build as Docker Build
participant Container as Container Startup
participant FS as Filesystem (.openclaw)
participant Env as Environment
participant Interactive as Interactive Session
participant Gateway as OpenClaw Gateway
Build->>FS: Write empty `gateway.auth.token` into openclaw.json
Note over Container: Container starts
Container->>Container: _read_gateway_token()
alt Root mode
Container->>Container: inject_gateway_token() (generate token)
Container->>FS: Atomically write gateway.auth.token into openclaw.json
Container->>FS: Recompute .config-hash
Container->>FS: Lock config (chattr +i)
else Non-root mode
Container->>Env: Use or generate OPENCLAW_GATEWAY_TOKEN
end
Container->>FS: Write /tmp/nemoclaw-gateway-token.env snapshot
Interactive->>FS: Source /tmp/nemoclaw-gateway-token.env
Interactive->>Gateway: Authenticate using injected token
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
🚀 Docs preview ready! |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/lib/onboard.ts (1)
7126-7130:⚠️ Potential issue | 🟡 MinorPoint the manual recovery hint at the token file.
This text sends users to
/tmp/gateway.log, but the runtime token is persisted to/tmp/nemoclaw-gateway-token.env. If automatic retrieval fails, the current guidance points at the wrong artifact.✏️ Proposed fix
- ` Token: see /tmp/gateway.log inside the sandbox, or re-run onboard.`, + ` Token: see /tmp/nemoclaw-gateway-token.env inside the sandbox, or re-run onboard.`, ... - ` append `#token`=<token> to the URL, or see /tmp/gateway.log inside the sandbox.`, + ` append `#token`=<token> to the URL, or inspect /tmp/nemoclaw-gateway-token.env inside the sandbox.`,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 7126 - 7130, The manual recovery hint currently points users to /tmp/gateway.log but the runtime token is actually written to /tmp/nemoclaw-gateway-token.env; update the two console.log messages in src/lib/onboard.ts (the token hint template strings near the token output block in onboard.ts) to reference /tmp/nemoclaw-gateway-token.env (and adjust the surrounding wording if needed) so both occurrences correctly instruct users to read that file when automatic retrieval fails.scripts/nemoclaw-start.sh (1)
716-731:⚠️ Potential issue | 🟠 MajorRemove the stale
/tmptoken file when no token is available.This branch unsets
OPENCLAW_GATEWAY_TOKENand strips rc snippets, but it leaves_GATEWAY_TOKEN_ENV_FILEbehind. Sinceproxy-env.shsources that file on every connect session, a failed generation or cleared token can still resurrect the previous credential and break TUI auth with a stale value.🧹 Minimal fix
if [ -z "$token" ]; then unset OPENCLAW_GATEWAY_TOKEN + rm -f "$_GATEWAY_TOKEN_ENV_FILE" 2>/dev/null || true for rc_file in "${_SANDBOX_HOME}/.bashrc" "${_SANDBOX_HOME}/.profile"; do if [ -f "$rc_file" ] && grep -qF "$marker_begin" "$rc_file" 2>/dev/null; then local tmp🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` around lines 716 - 731, When handling the branch where token is empty (the block that unsets OPENCLAW_GATEWAY_TOKEN and strips rc snippets), also remove the stale token env file by deleting _GATEWAY_TOKEN_ENV_FILE if it exists so proxy-env.sh cannot resurrect a stale credential; add a safe rm -f "${_GATEWAY_TOKEN_ENV_FILE:-}" (or test -n and unlink) after unsetting OPENCLAW_GATEWAY_TOKEN and before returning, referencing the existing variables OPENCLAW_GATEWAY_TOKEN and _GATEWAY_TOKEN_ENV_FILE and keeping the rest of the rc cleanup logic unchanged.
🧹 Nitpick comments (1)
docs/security/best-practices.md (1)
233-247: Split this section to one sentence per line and drop the routine bold emphasis.The new paragraph and default-row text pack multiple sentences onto single source lines, and
**not**reads like emphasis the style guide explicitly asks us to avoid in routine prose. LLM pattern detected.As per coding guidelines, "One sentence per line in source (makes diffs readable)." and "Unnecessary bold on routine instructions ... should be flagged."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/security/best-practices.md` around lines 233 - 247, Split the highlighted paragraph and the "Default" table cell so each sentence is on its own source line, and remove routine bold emphasis (e.g., change "The gateway auth token is **not** baked..." to "The gateway auth token is not baked..." and drop other unnecessary **...** formatting); update the "Runtime token generation" and "Default" sentences similarly so every sentence in that section is on a separate line and no routine instructions use bold.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 676-691: The current inline Python code in inject_gateway_token()
opens the target config (config_file/openclaw.json) with open(..., "w") which
truncates it and can leave a partial file if interrupted; instead, write the
updated JSON to a temporary file in the same directory (e.g., same basename +
suffix), flush and fsync the file descriptor and directory, then atomically
replace the original using os.replace(config_file, temp_path) to avoid
truncation windows, and only after the replace recompute the hash into
hash_file; ensure you reference the variables config_file, token, cfg, and
hash_file from the diff and use os.fsync/os.replace to make the swap durable
before computing the sha256sum.
In `@src/lib/onboard.ts`:
- Around line 6846-6852: A read/parse failure of openclaw.json inside the block
that checks result.status === 0 currently propagates to the outer catch and
prevents the subsequent env-file fallback (/tmp/nemoclaw-gateway-token.env) from
being attempted; to fix it, wrap the JSON file read/parse in a local try/catch
around the findOpenclawJsonPath/jsonPath handling (the code that sets cfg and
token), swallow/log the error locally and do not rethrow so control can continue
to the env-file fallback, leaving the rest of the function (including the
tmpDir/jsonPath logic and the later env-file check) unchanged.
---
Outside diff comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 716-731: When handling the branch where token is empty (the block
that unsets OPENCLAW_GATEWAY_TOKEN and strips rc snippets), also remove the
stale token env file by deleting _GATEWAY_TOKEN_ENV_FILE if it exists so
proxy-env.sh cannot resurrect a stale credential; add a safe rm -f
"${_GATEWAY_TOKEN_ENV_FILE:-}" (or test -n and unlink) after unsetting
OPENCLAW_GATEWAY_TOKEN and before returning, referencing the existing variables
OPENCLAW_GATEWAY_TOKEN and _GATEWAY_TOKEN_ENV_FILE and keeping the rest of the
rc cleanup logic unchanged.
In `@src/lib/onboard.ts`:
- Around line 7126-7130: The manual recovery hint currently points users to
/tmp/gateway.log but the runtime token is actually written to
/tmp/nemoclaw-gateway-token.env; update the two console.log messages in
src/lib/onboard.ts (the token hint template strings near the token output block
in onboard.ts) to reference /tmp/nemoclaw-gateway-token.env (and adjust the
surrounding wording if needed) so both occurrences correctly instruct users to
read that file when automatic retrieval fails.
---
Nitpick comments:
In `@docs/security/best-practices.md`:
- Around line 233-247: Split the highlighted paragraph and the "Default" table
cell so each sentence is on its own source line, and remove routine bold
emphasis (e.g., change "The gateway auth token is **not** baked..." to "The
gateway auth token is not baked..." and drop other unnecessary **...**
formatting); update the "Runtime token generation" and "Default" sentences
similarly so every sentence in that section is on a separate line and no routine
instructions use bold.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 59425d92-5290-4ebd-9859-0d3fe91f0133
📒 Files selected for processing (6)
.agents/skills/nemoclaw-user-configure-security/references/best-practices.mdDockerfiledocs/security/best-practices.mdscripts/nemoclaw-start.shsrc/lib/onboard.tstest/nemoclaw-start.test.ts
Adopt main's extracted generate-openclaw-config.py for config materialization while preserving the runtime-token design from this branch: - Config script writes an empty gateway auth token; entrypoint inject_gateway_token() / export_gateway_token() generate and install the real token at container startup. - Drop main's build-time secrets.token_hex(32) injection layer in favor of HEAD's clear-token RUN, which scrubs anything openclaw doctor / plugins install may auto-generate. - Update test/nemoclaw-start.test.ts: assert empty-token invariant against generate-openclaw-config.py (where the literal now lives) and check the script-call form in the Dockerfile. - Drop NEMOCLAW_BUILD_ID prefix from RUN layers (script is deterministic given env vars; clear-token is deterministic); retain the ARG so callers can still bust the cache when needed.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)
7233-7237:⚠️ Potential issue | 🟡 MinorFallback token hint points to the wrong file path.
When token retrieval fails, this message points users to
/tmp/gateway.log, but token export is written to/tmp/nemoclaw-gateway-token.env. That can misdirect recovery.✏️ Proposed wording update
- console.log( - ` Token: see /tmp/gateway.log inside the sandbox, or re-run onboard.`, - ); - console.log( - ` append `#token`=<token> to the URL, or see /tmp/gateway.log inside the sandbox.`, - ); + console.log( + " Token: run `cat /tmp/nemoclaw-gateway-token.env` inside the sandbox.", + ); + console.log(" append `#token`=<token> to the URL.");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 7233 - 7237, The fallback token hint messages incorrectly point to "/tmp/gateway.log"; update the two console.log lines that print the token recovery hint in onboard.ts so they reference the correct file "/tmp/nemoclaw-gateway-token.env" (replace `/tmp/gateway.log` in the string literals used in the backticked messages and the second console.log that says "append `#token`=<token> to the URL, or see /tmp/gateway.log inside the sandbox.") to avoid misdirection to the wrong log file.
♻️ Duplicate comments (1)
src/lib/onboard.ts (1)
6953-6959:⚠️ Potential issue | 🟠 MajorKeep env-file fallback reachable when
openclaw.jsonparse/read fails.A read/parse error here is caught by the outer
catch, which returns early and skips/tmp/nemoclaw-gateway-token.envfallback.🩹 Proposed fix
if (result.status === 0) { const jsonPath = findOpenclawJsonPath(tmpDir); if (jsonPath) { - const cfg = JSON.parse(fs.readFileSync(jsonPath, "utf-8")); - const token = cfg && cfg.gateway && cfg.gateway.auth && cfg.gateway.auth.token; - if (typeof token === "string" && token.length > 0) return token; + try { + const cfg = JSON.parse(fs.readFileSync(jsonPath, "utf-8")); + const token = cfg && cfg.gateway && cfg.gateway.auth && cfg.gateway.auth.token; + if (typeof token === "string" && token.length > 0) return token; + } catch { + // fall through to the env-file fallback below + } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 6953 - 6959, The current block that reads and parses openclaw.json (using findOpenclawJsonPath(tmpDir), fs.readFileSync and JSON.parse) can throw and cause the outer catch to return early, skipping the /tmp/nemoclaw-gateway-token.env fallback; change the logic so reading/parsing openclaw.json is wrapped in its own try/catch (inside the result.status === 0 branch) and on any read/parse error simply ignore the JSON path and continue (do not rethrow or return), then fall back to reading the env file (/tmp/nemoclaw-gateway-token.env) if no valid token found in cfg; ensure the function still returns immediately only when a valid string token is found.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
6901-6914: Consolidate duplicate recursive search implementations.
findFileRecursive(...)andfindOpenclawJsonPath(...)now duplicate the same traversal logic. Keep one recursive implementation and wrap specialized lookups around it.♻️ Proposed refactor
function findOpenclawJsonPath(dir: string): string | null { - if (!fs.existsSync(dir)) return null; - const entries = fs.readdirSync(dir, { withFileTypes: true }); - for (const e of entries) { - const p = path.join(dir, e.name); - if (e.isDirectory()) { - const found: string | null = findOpenclawJsonPath(p); - if (found) return found; - } else if (e.name === "openclaw.json") { - return p; - } - } - return null; + return findFileRecursive(dir, "openclaw.json"); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 6901 - 6914, There are two duplicate directory-traversal implementations (findFileRecursive and findOpenclawJsonPath); remove the duplicate by keeping a single recursive utility (e.g., findFileRecursive) and make findOpenclawJsonPath simply call that utility with the appropriate target name(s) or predicates; update any callers to use the shared function and delete the redundant traversal code to avoid maintenance drift (refer to functions findFileRecursive and findOpenclawJsonPath to locate and consolidate the logic).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 7233-7237: The fallback token hint messages incorrectly point to
"/tmp/gateway.log"; update the two console.log lines that print the token
recovery hint in onboard.ts so they reference the correct file
"/tmp/nemoclaw-gateway-token.env" (replace `/tmp/gateway.log` in the string
literals used in the backticked messages and the second console.log that says
"append `#token`=<token> to the URL, or see /tmp/gateway.log inside the
sandbox.") to avoid misdirection to the wrong log file.
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 6953-6959: The current block that reads and parses openclaw.json
(using findOpenclawJsonPath(tmpDir), fs.readFileSync and JSON.parse) can throw
and cause the outer catch to return early, skipping the
/tmp/nemoclaw-gateway-token.env fallback; change the logic so reading/parsing
openclaw.json is wrapped in its own try/catch (inside the result.status === 0
branch) and on any read/parse error simply ignore the JSON path and continue (do
not rethrow or return), then fall back to reading the env file
(/tmp/nemoclaw-gateway-token.env) if no valid token found in cfg; ensure the
function still returns immediately only when a valid string token is found.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 6901-6914: There are two duplicate directory-traversal
implementations (findFileRecursive and findOpenclawJsonPath); remove the
duplicate by keeping a single recursive utility (e.g., findFileRecursive) and
make findOpenclawJsonPath simply call that utility with the appropriate target
name(s) or predicates; update any callers to use the shared function and delete
the redundant traversal code to avoid maintenance drift (refer to functions
findFileRecursive and findOpenclawJsonPath to locate and consolidate the logic).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 76d0da82-54f4-4081-adbe-a64e3f57193a
📒 Files selected for processing (4)
Dockerfilescripts/nemoclaw-start.shsrc/lib/onboard.tstest/nemoclaw-start.test.ts
Drives `openclaw tui` as a proof-of-life canary across the build-baked-token → runtime-injected-token transition (issue #2480): - Builds a sandbox from the pre-PR commit (token baked via secrets.token_hex at build time), runs TUI smoke test, captures the build-baked token. - Confirms the pre-upgrade token is stable across container restarts (build-baked behavior baseline). - Rebuilds with current source via `nemoclaw rebuild --yes`, runs TUI smoke test, asserts the post-upgrade token differs from the pre-upgrade token. - Cross-checks the host-side `nemoclaw gateway-token` command matches the in-sandbox token. - Restarts the sandbox and asserts the token rotates again (proves the runtime injection mechanism, not just the layer cache busting). The TUI smoke test mirrors OpenClaw 2026.4.9's resolution order (env var → gateway.auth.token in config) so a "Missing gateway auth token" regression fails the test fast and points at the correct subsystem. Wired into nightly-e2e.yaml with fetch-depth: 0 so the worktree add of the pre-ref commit succeeds on the runner.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/test-gateway-token-upgrade.sh`:
- Around line 95-123: The test's sandbox call is masking failures (the "|| true"
after the openshell sandbox exec) and the token validation is too lax; remove
the "|| true" (or capture and assert the exec's exit status) so failures cause
the test to fail, then strengthen the token check on the variable token to
assert it matches the expected format (e.g., a 64-character hex string like
/^[0-9a-f]{64}$/) instead of only length >=16; update the tui_smoke() flow that
sets resolved/token and the conditional that currently allows stderr to be
treated as a token to explicitly fail when the sandbox exec fails or when token
does not match the hex regex.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0a48e7e4-4234-48f3-af7b-50e89930b57f
📒 Files selected for processing (2)
.github/workflows/nightly-e2e.yamltest/e2e/test-gateway-token-upgrade.sh
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)
6359-6363:⚠️ Potential issue | 🟡 MinorFallback guidance appears to reference the wrong recovery artifact.
The message now points users to
/tmp/gateway.log, but the runtime fallback source written by startup is/tmp/nemoclaw-gateway-token.env. This can send users to a dead end when token auto-fetch fails.🧭 Suggested wording update
- ` Token: see /tmp/gateway.log inside the sandbox, or re-run onboard.`, + ` Token: see /tmp/nemoclaw-gateway-token.env inside the sandbox, or re-run onboard.`, ); console.log( - ` append `#token`=<token> to the URL, or see /tmp/gateway.log inside the sandbox.`, + ` append `#token`=<token> to the URL (from OPENCLAW_GATEWAY_TOKEN in that env file).`, );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 6359 - 6363, Update the fallback guidance strings that currently reference `/tmp/gateway.log` to point to the actual token file `/tmp/nemoclaw-gateway-token.env`; locate the two console.log calls in src/lib/onboard.ts that print the token recovery hints (the lines containing the `/tmp/gateway.log` literal) and replace the path in both messages so they read something like "see /tmp/nemoclaw-gateway-token.env inside the sandbox" while keeping the existing guidance about appending `#token=<token>` to the URL.
♻️ Duplicate comments (1)
src/lib/onboard.ts (1)
6088-6094:⚠️ Potential issue | 🟠 MajorJSON parse errors still prevent the env-file fallback path.
At Line 6091, a read/parse failure throws to the outer
catch(Line 6112), so step 2 (/tmp/nemoclaw-gateway-token.env) is never attempted.🩹 Suggested fix
if (result.status === 0) { const jsonPath = findOpenclawJsonPath(tmpDir); if (jsonPath) { - const cfg = JSON.parse(fs.readFileSync(jsonPath, "utf-8")); - const token = cfg && cfg.gateway && cfg.gateway.auth && cfg.gateway.auth.token; - if (typeof token === "string" && token.length > 0) return token; + try { + const cfg = JSON.parse(fs.readFileSync(jsonPath, "utf-8")); + const token = cfg && cfg.gateway && cfg.gateway.auth && cfg.gateway.auth.token; + if (typeof token === "string" && token.length > 0) return token; + } catch { + // Continue to env-file fallback. + } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 6088 - 6094, The JSON read/parse in the openclaw JSON block can throw and short-circuit the env-file fallback; wrap the file read+JSON.parse for the path returned by findOpenclawJsonPath(tmpDir) in a local try/catch so parse/read errors are swallowed (or logged) and control falls through to the subsequent env-file fallback (do not return or rethrow on parse failure). Specifically, locate the block that calls findOpenclawJsonPath(tmpDir), JSON.parse(fs.readFileSync(...)) and extracts gateway.auth.token, and change it so only a successfully parsed string token causes an immediate return, while any exception simply skips to the next step.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 6359-6363: Update the fallback guidance strings that currently
reference `/tmp/gateway.log` to point to the actual token file
`/tmp/nemoclaw-gateway-token.env`; locate the two console.log calls in
src/lib/onboard.ts that print the token recovery hints (the lines containing the
`/tmp/gateway.log` literal) and replace the path in both messages so they read
something like "see /tmp/nemoclaw-gateway-token.env inside the sandbox" while
keeping the existing guidance about appending `#token=<token>` to the URL.
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 6088-6094: The JSON read/parse in the openclaw JSON block can
throw and short-circuit the env-file fallback; wrap the file read+JSON.parse for
the path returned by findOpenclawJsonPath(tmpDir) in a local try/catch so
parse/read errors are swallowed (or logged) and control falls through to the
subsequent env-file fallback (do not return or rethrow on parse failure).
Specifically, locate the block that calls findOpenclawJsonPath(tmpDir),
JSON.parse(fs.readFileSync(...)) and extracts gateway.auth.token, and change it
so only a successfully parsed string token causes an immediate return, while any
exception simply skips to the next step.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 38a2f7ba-7e06-49da-a159-f0b6b456a94f
📒 Files selected for processing (1)
src/lib/onboard.ts
- inject_gateway_token writes openclaw.json via tempfile + fsync +
os.replace so an interrupted root-mode startup cannot leave a
truncated config plus stale hash that would brick the next boot.
- export_gateway_token's empty-token branch now removes the /tmp
token env file so proxy-env.sh cannot resurrect a stale credential
on the next connect session.
- fetchGatewayAuthTokenFromSandbox catches openclaw.json read/parse
failures locally so the /tmp/nemoclaw-gateway-token.env fallback
still runs when the JSON path is unreadable or malformed.
- Manual recovery hint in onboard.ts points at
/tmp/nemoclaw-gateway-token.env (the actual token file) instead of
/tmp/gateway.log.
- tui_smoke captures the openshell exec exit status separately
(no `|| true` masking) and asserts the resolved token matches the
exact secrets.token_hex(32) format (^[0-9a-fA-F]{64}$) so a stderr
fragment cannot satisfy a length-only check.
- best-practices.md (and the skills mirror): split the new paragraph
to one sentence per line and drop the routine bold emphasis on
`**not**` per the docs style guide.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
scripts/nemoclaw-start.sh (1)
677-708:⚠️ Potential issue | 🟠 MajorMake the config rename durable before updating
.config-hash.The temp-file +
os.replace()closes the truncation window, but the containing directory is neverfsync'd. A crash after the replace but before that directory entry is durable can leave the oldopenclaw.jsonalongside a new.config-hash, which will tripverify_config_integrity()on the next boot.💡 Minimal durability fix
python3 - "$config_file" "$token" <<'PYINJECT' import json, os, sys, tempfile config_file, token = sys.argv[1], sys.argv[2] +dir_name = os.path.dirname(config_file) or "." +dir_fd = os.open(dir_name, os.O_RDONLY) with open(config_file) as f: cfg = json.load(f) cfg.setdefault("gateway", {}).setdefault("auth", {})["token"] = token -dir_name = os.path.dirname(config_file) or "." fd, tmp_path = tempfile.mkstemp(dir=dir_name, prefix=".openclaw.json.") try: with os.fdopen(fd, "w") as f: json.dump(cfg, f, indent=2) f.flush() os.fsync(f.fileno()) os.chmod(tmp_path, 0o444) os.replace(tmp_path, config_file) + os.fsync(dir_fd) except Exception: try: os.unlink(tmp_path) except FileNotFoundError: pass raise +finally: + os.close(dir_fd) PYINJECT🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` around lines 677 - 708, The atomic replace leaves the directory entry not yet durable; after calling os.replace(tmp_path, config_file) in the Python block, open the containing directory (use os.open(dir_name, os.O_RDONLY) or os.O_DIRECTORY) and call os.fsync(dir_fd) then close the fd so the rename is flushed to disk before the script updates the .config-hash; this ensures the directory entry for openclaw.json is durable prior to running the sha256sum step. Reference: the Python temp-file flow using tempfile.mkstemp, os.replace(tmp_path, config_file), and the subsequent sha256sum update; add the dir fsync right after os.replace and before leaving the try block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 677-708: The atomic replace leaves the directory entry not yet
durable; after calling os.replace(tmp_path, config_file) in the Python block,
open the containing directory (use os.open(dir_name, os.O_RDONLY) or
os.O_DIRECTORY) and call os.fsync(dir_fd) then close the fd so the rename is
flushed to disk before the script updates the .config-hash; this ensures the
directory entry for openclaw.json is durable prior to running the sha256sum
step. Reference: the Python temp-file flow using tempfile.mkstemp,
os.replace(tmp_path, config_file), and the subsequent sha256sum update; add the
dir fsync right after os.replace and before leaving the try block.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 67db3189-ab34-4571-8e6e-9c2b9fcea01c
📒 Files selected for processing (5)
.agents/skills/nemoclaw-user-configure-security/references/best-practices.mddocs/security/best-practices.mdscripts/nemoclaw-start.shsrc/lib/onboard.tstest/e2e/test-gateway-token-upgrade.sh
🚧 Files skipped from review as they are similar to previous changes (4)
- docs/security/best-practices.md
- .agents/skills/nemoclaw-user-configure-security/references/best-practices.md
- src/lib/onboard.ts
- test/e2e/test-gateway-token-upgrade.sh
The first nightly run (run #25009819637) failed at Phase 3 with:
status: InvalidArgument, message: "command argument 2 contains
newline or carriage return characters"
`openshell sandbox exec`'s gRPC layer rejects any argument that
contains a newline or carriage return, so the bash heredoc that
used to feed an inline python script into the sandbox was always
going to fail in CI.
- Replace the heredoc-fed in-sandbox python with two single-line
calls: a `printf %s "${OPENCLAW_GATEWAY_TOKEN}"` env-var probe
and a host-side parse of the openclaw.json that we pull via
`openshell sandbox download`. Both are single-line commands the
exec layer accepts.
- Single-line the `openclaw tui` invocation as well (no heredoc).
- Switch `tui_smoke` to return its result via the
TUI_RESULT_TOKEN global. `pass()` writes to stdout, so capturing
the function with `$(...)` mixed the [PASS] log line into the
"token" string and would have made every cross-phase token
comparison fail even on a passing run.
…ntext The optimised sandbox build context staging in `stageOptimizedSandboxBuildContext` only copied `nemoclaw-start.sh` and `lib/sandbox-init.sh` into the staged scripts directory, but the Dockerfile (line 195) also COPYs `scripts/generate-openclaw-config.py`. The script was added in ca46f47 ("feat: extract inline Python config to scripts/generate-openclaw-config.py") which updated the Dockerfile but not the staging function, so `nemoclaw onboard` fails at the COPY step. The legacy staging path (`stageLegacySandboxBuildContext`) recursively copies the entire `scripts/` directory and is therefore unaffected. Also extend the optimised-staging test to assert both `generate-openclaw-config.py` and `lib/sandbox-init.sh` are present in the staged context, so future Dockerfile additions that miss the staging function will be caught immediately. Fixes #2503 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Phase 6 of the previous run failed with "token did not rotate across restart" 14ms after issuing `openshell sandbox restart` — the command returns before the pod has actually transitioned, so the wait loop saw "still Ready" immediately and read the pre-restart token back. - Extract fetch_resolved_token() so token resolution can be reused by both the assertive tui_smoke path and a non-fatal poll. - Add wait_for_token_change() that polls the resolved token until it differs from a baseline (deadline 90s). Token-rotation is the ground truth we care about — observing it is unambiguous proof the entrypoint re-ran. - Phase 6 now waits for rotation via wait_for_token_change, then runs a final tui_smoke against the rotated container. - Phase 3's redundant "stable across restart" sub-check is dropped: build-baked tokens look stable even when `sandbox restart` is a no-op, so the assertion couldn't actually catch what it claimed to catch.
There is no `openshell sandbox restart` subcommand — the openshell
CLI exposes create / get / list / delete / exec / connect /
upload / download / ssh-config and that is it. The previous
`openshell sandbox restart "${SANDBOX_NAME}" >/dev/null 2>&1 || true`
was silently failing every run, so the wait_for_token_change loop
was polling a still-alive container and timing out at 240s.
Switch Phase 6 to `nemoclaw <name> rebuild --yes`. The image layer
is already cached from Phase 4, so this re-runs container creation
only — close enough to "container restart" for the purpose of the
test. A fresh entrypoint run is exactly what triggers
inject_gateway_token() to write a new token, which is what we
want to verify.
Bumped the wait deadline to 240s to tolerate a cached-image
rebuild plus Ready transition without being so generous that a
real "entrypoint never ran" regression hides.
Pulls in: - a323104 fix(sandbox): include generate-openclaw-config.py in optimized build context (#2565) — same fix as the cherry-pick on this branch (ddb9e15), collapses cleanly. - d392ec0 fix(onboard): clarify preflight messages reference local NIM (#2575) # Conflicts: # test/sandbox-build-context.test.ts
CodeRabbit (review 4182929863) flagged that the atomic-replace flow in inject_gateway_token() fsyncs the file content but never fsyncs the containing directory. A crash between os.replace() and the subsequent .config-hash sha256sum write could leave the directory entry not yet durable on disk — meaning a stale openclaw.json content next to a fresh hash, which would trip verify_config_integrity() on the next boot. Open the parent dir before the rename, fsync it after os.replace, close the fd in finally. This closes the durability window without adding a meaningful boot-path cost (one extra fsync, only on root- mode startup).
|
Cross-linking from #2571 / #3166 — when this lands, please add one more test to the non-root smoke job that #3166 is introducing: Test 4: This is the regression guard for #2482 and is deliberately being held back from #3166 because it can only pass once the runtime token injection in this PR lands. The hook point is Full context and acceptance checkbox tracked on #2571. |
<!-- markdownlint-disable MD041 --> ## Summary Adds the `test-non-root-sandbox-smoke` test from #2571 — a PR-gate job that runs the production image under `-security-opt no-new-privileges` to catch #2472 and #2482 regressions, without OpenShell, NVIDIA_API_KEY, or live inference. ## Related Issue Part of #2571 ## Changes - New `test/e2e-non-root-smoke.sh` (host-side bash, no `openshell`/`nemoclaw` CLI required): - **Test 1** — entrypoint setup chain completes cleanly under `--security-opt no-new-privileges` (regression guard for # 2472; passes a `true` command via the entrypoint's `NEMOCLAW_CMD` exec path so the gateway-launch branch is bypassed and we don't need the OpenShell-managed runtime). - **Test 2** — kernel confirms `NoNewPrivs=1` inside the container (defends the test itself against silent typos in the docker flag). - New job `test-non-root-sandbox-smoke` in `.github/workflows/pr-self-hosted.yaml` — `linux-amd64-cpu4`, `timeout-minutes: 5`, `needs: build-sandbox-images`, reuses the existing `isolation-image` artifact. - Expected results: ``` my-machine@ab1-cdf40-30:~/NemoClaw$ # Run script bash test/e2e-non-root-smoke.sh TEST: 1. Entrypoint setup chain completes under --security-opt no-new-privileges PASS: entrypoint exited 0 under no-new-privileges (#2472 setup chain healthy) TEST: 2. Kernel confirms NoNewPrivs=1 inside container (defends against silent flag typos) PASS: kernel confirms NoNewPrivs=1 ======================================== Results: 2 passed, 0 failed ======================================== ``` - Upcoming plans: - **Test 3** — `openclaw tui` does not error with "Missing gateway auth token" inside a login shell under the same constraint (regression guard for # 2482) after PR #2485 is merged ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification <!-- Check each item you ran and confirmed. Leave unchecked items you skipped. Doc-only changes do not require npm test unless you ran it. --> - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [ ] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- <!-- DCO sign-off required by CI. Run: git config user.name && git config user.email --> Signed-off-by: Hung Le <hple@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Summary
openclaw tuiworks in both root and non-root sandbox modes (fixes the regression from the reverted feat(security): externalize gateway auth token from openclaw.json #2378)What we learned from the first attempt (#2378 / 51aa6af)
The original externalization (51aa6af) removed the token from
openclaw.jsonentirely, generated it at container startup, and passed it only to the gateway process viaOPENCLAW_GATEWAY_TOKENon the launch line. This brokeopenclaw tuiin three ways:openclaw tuicouldn't read the token from config. The token was cleared fromopenclaw.jsonat build time, so OpenClaw's native config resolution (gateway.auth.token) returned empty.openclaw tuicouldn't inherit the env var.OPENCLAW_GATEWAY_TOKENwas scoped to the gateway process launch line (OPENCLAW_GATEWAY_TOKEN="$token" nohup ... &), not exported to the shell env. Interactive sessions started viaopenshell sandbox connectnever saw it — they source.bashrc/.profileandproxy-env.sh, none of which contained the token.Non-root mode crashed the entrypoint. The
install_configure_guardfunction (added in a separate commit) wrote to.bashrc/.profilewithout|| trueguards. Under Landlock, the DAC check ([ -w file ]) passed but the actual write was blocked, crashing the entrypoint underset -ebefore the gateway ever started. This caused a 5-day outage on non-root sandboxes (Brev Launchable, Spark).Issue #2480 confirmed the symptom:
openclaw tuiinside the sandbox failed with "Missing gateway auth token" because none of the resolution paths had a token.How this PR handles things differently
Root mode (full privilege separation)
inject_gateway_token()generates a fresh token at startup and writes it intoopenclaw.jsonbeforechattr +ilocks the config. This meansopenclaw tuiand all OpenClaw tools can read it natively from the config — no env var dependency required.export_gateway_token()also exportsOPENCLAW_GATEWAY_TOKENto the process env, persists it to.bashrc/.profilemarker blocks, and writes/tmp/nemoclaw-gateway-token.env(a sourced file). This is belt-and-suspenders — the config path is primary, the env var is backup.Non-root mode (no privilege separation)
openclaw.json(root:root 444, and we're running as sandbox user).inject_gateway_token()returns early with a log message.export_gateway_token()detects the empty token inopenclaw.json, generates a fresh one, and delivers it via:export OPENCLAW_GATEWAY_TOKEN="$token"in the current process (gateway inherits it).bashrc/.profile(best-effort, Landlock may block — all writes use|| true)/tmp/nemoclaw-gateway-token.envviaemit_sandbox_sourced_file(root:root 444, tamper-proof)proxy-env.shconditionally sources the/tmptoken file. Even if rc-file writes are Landlock-blocked, connect sessions pick up the token because.bashrcsourcesproxy-env.shwhich sources the token file. This is the key missing piece from the first attempt.OpenClaw 2026.4.9 compatibility
OpenClaw resolves the gateway token in this order:
OPENCLAW_GATEWAY_TOKENenv var →--tokenCLI flag →gateway.auth.tokenin config → secret provider. Both modes now deliver the token via at least one of these paths:gateway.auth.token)OPENCLAW_GATEWAY_TOKEN)/tmpfileSecurity improvements over build-time token
docker historycannot extract it/tmp/nemoclaw-gateway-token.envis created viaemit_sandbox_sourced_file(root:root 444 in root mode) and validated byvalidate_tmp_permissionsFiles changed
Dockerfilesecretsimport, added post-plugin token clearing RUNscripts/nemoclaw-start.shinject_gateway_token(), extendedexport_gateway_token()with non-root fallback +/tmpfile + Landlock guards, added proxy-env.sh token source linesrc/lib/onboard.ts/tmp/nemoclaw-gateway-token.envfallback for host-side token retrieval (non-root)test/nemoclaw-start.test.ts/tmpdeliverydocs/security/best-practices.md.agents/skills/.../best-practices.mdSupersedes reverted commit 51aa6af. Fixes #2480.
Test plan
npm run typecheck:clipassesopenclaw tuiworks inside sandbox after rebuild (root mode)openclaw tuiworks inside sandbox after rebuild (non-root mode)docker historydoes not contain the tokenSummary by CodeRabbit
Security
Documentation
Tests