fix(sandbox): resolve Slack Socket Mode invalid_auth startup crash (#2085)#2151
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds conditional injection of Slack bot/app tokens into sandbox creation env and a startup routine that, on sandbox boot, validates and patches Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Onboard as "onboard (src/lib/onboard.ts)"
participant SandboxEntrypoint as "sandbox entrypoint (scripts/nemoclaw-start.sh)"
participant FS as "/sandbox/.openclaw/openclaw.json"
participant ConfigHash as "config-hash (sha256)"
participant Gateway as "gateway process"
User->>Onboard: run sandbox create (includes --env SLACK_BOT_TOKEN / SLACK_APP_TOKEN when available)
Onboard->>SandboxEntrypoint: start sandbox
SandboxEntrypoint->>SandboxEntrypoint: apply_cors_override()
SandboxEntrypoint->>SandboxEntrypoint: apply_slack_token_override()
SandboxEntrypoint->>FS: read openclaw.json
SandboxEntrypoint->>FS: replace openshell:resolve:env:... with real SLACK tokens
SandboxEntrypoint->>ConfigHash: recompute/write config-hash
SandboxEntrypoint->>SandboxEntrypoint: unset SLACK_BOT_TOKEN SLACK_APP_TOKEN (root)
SandboxEntrypoint->>Gateway: start gateway with patched config
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…2085) SLACK_BOT_TOKEN and SLACK_APP_TOKEN were stored as openshell provider credentials but never forwarded as --env to the sandbox create command. The baked openclaw.json uses openshell:resolve:env: placeholders for both tokens; without the real values in the container env those placeholders stay unresolved and Bolt Socket Mode crashes on boot with invalid_auth. Mirrors the existing BRAVE_API_KEY pattern: real token values are injected as positional --env args (visible in the command string) while remaining absent from the openshell subprocess spawn env, which is governed by the separate blockedSandboxEnvNames allowlist. Fixes NVIDIA#2085 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
aed6536 to
816efb7
Compare
…artup Add apply_slack_token_override to nemoclaw-start.sh (after apply_cors_override, before chattr +i) using the same root-only, symlink-guarded, hash-recomputed pattern as apply_model_override. The function substitutes openshell:resolve:env: placeholders in openclaw.json directly, so Bolt's in-process appToken validation (xapp- prefix check) passes before any network connection is opened — without waiting for the L7 proxy. Both SLACK_BOT_TOKEN and SLACK_APP_TOKEN are unset from the process env after patching in the root path so tokens are not visible to the gateway or sandbox user processes. Token format (xoxb-/xapp-) is validated before patching. Signed-off-by: Dongni Yang <dongniy@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Line 935: The script currently calls apply_slack_token_override in a non-root
branch where that function is a no-op, allowing unresolved Slack placeholders to
slip through; update the startup logic so that before entering the non-root
fallback you detect Slack configuration (e.g., presence of SLACK_BOT_TOKEN or
unresolved openshell:resolve:env:* placeholders) and if the process is non-root
(id -u != 0) fail fast with a clear error and exit, or alternatively move the
placeholder resolution earlier so apply_slack_token_override runs while running
as root; reference apply_slack_token_override and the SLACK_BOT_TOKEN
placeholder checks when implementing this change.
- Around line 1074-1076: The Slack tokens are being unset too late (after
sandbox processes are already spawned); move the unset of SLACK_BOT_TOKEN and
SLACK_APP_TOKEN to immediately after apply_slack_token_override() so no
subsequent gosu sandbox child inherits them, or alternatively ensure every gosu
sandbox invocation clears them (e.g., prefix gosu calls with an explicit
environment scrub for SLACK_BOT_TOKEN/SLACK_APP_TOKEN or use env -u for those
vars before calling gosu); update the script locations around
apply_slack_token_override(), the gosu sandbox bash -c invocations and the exec
gosu "${NEMOCLAW_CMD[@]}" usages to apply the scrub consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 557aef15-1a63-4504-b978-1e769943e09f
📒 Files selected for processing (2)
scripts/nemoclaw-start.shtest/nemoclaw-start.test.ts
✅ Files skipped from review due to trivial changes (1)
- test/nemoclaw-start.test.ts
…A#2085) - Move `unset SLACK_BOT_TOKEN SLACK_APP_TOKEN` to before the first `gosu sandbox` child in the root path, so sandbox-side processes cannot inherit real token values - Add fail-fast in the non-root path: exit 1 with a clear error when SLACK_BOT_TOKEN is set but placeholder resolution needs root - Apply shfmt and Prettier formatting (case-statement indentation, line-continuation style) to pass CI pre-push hook Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Signed-off-by: Dongni Yang <dongniy@nvidia.com>
…under set -u NEMOCLAW_CONTEXT_WINDOW, NEMOCLAW_MAX_TOKENS, and NEMOCLAW_REASONING are baked into the image ENV and are always non-empty, so apply_model_override's guard always fires. The previous `local model_override="$NEMOCLAW_MODEL_OVERRIDE"` would abort the entrypoint under set -euo pipefail whenever an operator did not pass NEMOCLAW_MODEL_OVERRIDE — causing apply_slack_token_override to never run. Signed-off-by: Dongni Yang <dongniy@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3656-3665: The interactive Slack onboarding never captures
SLACK_APP_TOKEN because setupMessagingChannels() only prompts/saves ch.envKey
(which becomes SLACK_BOT_TOKEN) and the existing env forwarding block only adds
SLACK_APP_TOKEN when it already exists in tokensByEnvKey; modify onboarding to
also prompt/save the app token key (SLACK_APP_TOKEN) during
setupMessagingChannels() and ensure tokensByEnvKey is populated with that key so
the envArgs push (using formatEnvAssignment and envArgs) will include
SLACK_APP_TOKEN even for fresh installs; update setupMessagingChannels(), the
ch.envKey handling, and any token-save logic to persist both SLACK_BOT_TOKEN and
SLACK_APP_TOKEN.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 9ed77f40-0a2d-4414-adc8-8d63386a657b
📒 Files selected for processing (3)
scripts/nemoclaw-start.shsrc/lib/onboard.tstest/nemoclaw-start.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- test/nemoclaw-start.test.ts
- scripts/nemoclaw-start.sh
…etup (NVIDIA#2085) setupMessagingChannels() only prompted for ch.envKey (SLACK_BOT_TOKEN) but never for ch.appTokenEnvKey (SLACK_APP_TOKEN). A fresh interactive Slack onboard left the app token unconfigured, so the --env injection added in the previous commit had nothing to forward. Follows the same pattern as serverIdEnvKey handling (Discord guild ID) and uses the appTokenHelp / appTokenLabel already defined in KNOWN_CHANNELS.slack. Skips the channel if the app token is omitted since Socket Mode requires both tokens. Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dongni Yang <dongniy@nvidia.com>
|
Hey @Dongni-Yang — nice work on this one. The three-part fix is well-structured, the security properties are solid, and the tests are thorough. A few thoughts: One thing to addressPartial resolution when Could you add a warning in if [ -z "${SLACK_APP_TOKEN:-}" ]; then
printf '[channels] Warning: SLACK_BOT_TOKEN is set but SLACK_APP_TOKEN is missing — Socket Mode requires both tokens\n' >&2
fiThat way non-interactive deployments get a clear hint before Bolt fails. A couple of nits (feel free to address while you're in there)
Thanks again for the thorough fix and test coverage! |
Add warning in apply_slack_token_override when the bot token is present but the app token is absent so non-interactive deployments get a clear signal before Bolt crashes with invalid_auth. Resolves merge conflict in test/nemoclaw-start.test.ts with upstream/main: keep NEMOCLAW_REASONING assertion and adopt upstream's shfmt-aware regex for the return-0 guard check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Thanks for the thorough review! Required change addressed — added the SLACK_APP_TOKEN missing warning to apply_slack_token_override (commit 7c19ffa). When SLACK_BOT_TOKEN is present but SLACK_APP_TOKEN is absent, the function now emits: [channels] Warning: SLACK_BOT_TOKEN is set but SLACK_APP_TOKEN is missing — Socket Mode requires both tokens Nits — will address both in a follow-up PR today: json.dump reformatting → switch to regex-based substitution to preserve original openclaw.json formatting |
## Summary Two optional nits deferred from PR #2151 review: - **Regex substitution** — replace `json.load` + `json.dump(cfg, f, indent=2)` in `apply_slack_token_override` with `re.sub` targeting only the two `openshell:resolve:env:SLACK_*` placeholder values in-place, preserving original `openclaw.json` formatting - **Fold non-root fail-fast into function** — when non-root and `SLACK_BOT_TOKEN` is set, `apply_slack_token_override` now returns 1 directly (instead of returning 0 and relying on a separate post-call guard at the call site); `set -euo pipefail` propagates the `return 1` to a script exit with the same code and message as before ## Test plan - [ ] `npx vitest run test/nemoclaw-start.test.ts` — 85/85 pass - [ ] Two tests updated: "only applies override in root mode" (now also asserts `return 1` on non-root+token path) and "fails fast when SLACK_BOT_TOKEN is set in non-root mode" (now checks function body, asserts no separate call-site guard) - [ ] `shfmt -i 2 -ci -bn -l scripts/nemoclaw-start.sh` — no output (clean) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Enhanced Slack token configuration validation to immediately fail with specific error messages when attempting to use bot tokens in non-root mode. * **Tests** * Updated tests to verify the new fail-fast security behavior for Slack token overrides. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Dongni Yang <dongniy@nvidia.com> --------- Signed-off-by: Dongni Yang <dongniy@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Slack Socket Mode crashed on boot with
invalid_authdue to three wiring gaps:onboard.ts—openshell sandbox createnever forwardedSLACK_BOT_TOKEN/SLACK_APP_TOKENas--envpositional args, so the container env received nothing (or stale placeholder strings) instead of real token values.nemoclaw-start.sh— even with tokens in the container env, the bakedopenshell:resolve:env:*placeholders inopenclaw.jsonwere never resolved before OpenClaw read the file. Bolt validatesappTokenstarts withxapp-in-process before any network call, so the L7 proxy never gets a chance to intercept — Bolt rejects the placeholder string immediately withinvalid_auth.setupMessagingChannels()— interactive onboarding only prompted forSLACK_BOT_TOKEN(ch.envKey) and never forSLACK_APP_TOKEN(ch.appTokenEnvKey), so fresh interactive setups left the app token unconfigured, meaning gap 1's--envinjection had nothing to forward.This PR fixes all three gaps:
onboard.tsinjects both tokens as--env KEY=VALUEpositional args at sandbox-create time (same pattern asBRAVE_API_KEY).setupMessagingChannels()now prompts forSLACK_APP_TOKENafterSLACK_BOT_TOKEN, using the existingappTokenHelp/appTokenLabelfields inKNOWN_CHANNELS.slack. Skips the channel if the app token is omitted (Socket Mode requires both).nemoclaw-start.shaddsapply_slack_token_override, which runs as root beforechattr +ilocksopenclaw.json. It substitutes theopenshell:resolve:env:*placeholders with real token values read from the container env, then unsets the env vars before the firstgosu sandboxchild so tokens never reach sandbox or gateway processes. In the non-root fallback path, startup now fails fast with a clear error whenSLACK_BOT_TOKENis set (placeholder resolution requires root; Bolt would crash anyway).Together these changes make the full resolution path deterministic and independent of OpenShell#894.
This PR also fixes a pre-existing bug in
apply_model_override:NEMOCLAW_CONTEXT_WINDOW,NEMOCLAW_MAX_TOKENS, andNEMOCLAW_REASONINGare baked into the imageENVand are always non-empty, so the function's guard always fired — but the code then read$NEMOCLAW_MODEL_OVERRIDEwithout a:-fallback, aborting the entrypoint underset -euo pipefailwhenever only a context-window or reasoning override was set. Fixed by using${NEMOCLAW_MODEL_OVERRIDE:-}.Related Issue
Fixes #2085
How the fixes work together
nemoclaw onboardinteractivelysetupMessagingChannels()prompts forSLACK_BOT_TOKENthenSLACK_APP_TOKEN; both saved to credentialsopenshell sandbox createis calledonboard.tspasses--env SLACK_BOT_TOKEN=xoxb-…and--env SLACK_APP_TOKEN=xapp-…as positional args; tokens land in the container envapply_slack_token_overridevalidates prefixes, patchesopenclaw.jsonplaceholders in-place, recomputes sha256 hashgosu sandboxchildunset SLACK_BOT_TOKEN SLACK_APP_TOKENremoves tokens from process envopenclaw.jsonxoxb-/xapp-values, in-process validation passes, Socket Mode connectsSecurity properties
openshell sandbox createpositional args (never in the subprocess spawn env, governed bybuildSubprocessEnv()'s allowlist).apply_slack_token_overrideis root-only, refuses symlinks, validates token prefixes (xoxb-/xapp-), and recomputes the integrity hash.gosu sandboxchild (not just before gateway start), so no sandbox-side process can inherit them from the env.SLACK_BOT_TOKENis set, rather than silently starting with unresolved placeholders.Changes
src/lib/onboard.ts: after theBRAVE_API_KEYblock, pushSLACK_BOT_TOKENand (when present)SLACK_APP_TOKENas--envpositional args inopenshell sandbox create. InsetupMessagingChannels(), add prompt block forch.appTokenEnvKey(parallel toserverIdEnvKeyhandling for Discord), so fresh interactive Slack onboarding captures both tokens.test/onboard.test.ts: addSLACK_APP_TOKENto child-script env; assert both tokens appear in the create command string; assert neither leaks into the spawn env.scripts/nemoclaw-start.sh: addapply_slack_token_override(root-only, symlink-guarded, prefix-validated, hash-recomputed); call it in both paths; moveunset SLACK_BOT_TOKEN SLACK_APP_TOKENto before firstgosu sandboxchild; add non-root fail-fast whenSLACK_BOT_TOKENis set; apply shfmt/Prettier formatting. Also fix pre-existing unbound variable inapply_model_override: use${NEMOCLAW_MODEL_OVERRIDE:-}soset -euo pipefaildoes not abort the entrypoint whenNEMOCLAW_CONTEXT_WINDOW/NEMOCLAW_REASONINGis set butNEMOCLAW_MODEL_OVERRIDEis not.test/nemoclaw-start.test.ts: adddescribeblock (11 tests) covering function definition, call order in both paths, no-op/root-only/symlink guards, prefix validation, missing-app-token warning, hash recompute, placeholder resolution, pre-gosu env scrub, and non-root fail-fast.Type of Change
Verification
npx prek run --all-filespassesnpm testpasses (onboard: 129/129; nemoclaw-start: 85/85; plugin: 354/354)make docsbuilds without warnings (doc changes only)AI Disclosure
Signed-off-by: Dongni Yang dongniy@nvidia.com