fix(security): use providers for messaging credential injection#1081
Conversation
Create OpenShell providers for Discord, Slack, and Telegram tokens during onboard instead of passing them as raw environment variables into the sandbox. The L7 proxy rewrites Authorization headers (Bearer/Bot) with real secrets at egress, so sandbox processes never see the actual token values. - Discord and Slack tokens flow through provider/placeholder system - Telegram provider is created for credential storage but host-side bridge still reads from host env (Telegram uses URL-path auth that the proxy can't rewrite yet) - Raw env var injection removed for Discord and Slack
|
Important Review skippedThis PR was authored by the user configured for CodeRabbit reviews. CodeRabbit does not review PRs authored by this user. It's recommended to use a dedicated user account to post CodeRabbit review feedback. ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 Walkthrough📝 Walkthrough🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
bin/lib/onboard.js (1)
1816-1819: Consider adding a defensive test for credential deletion.Per the test at
test/credential-exposure.test.js:76-84, the current tests verify that credentials aren't pushed toenvArgs, but don't assert that thedeletestatements exist. This means if the deletion logic were accidentally removed, the test would still pass.Consider adding a test that explicitly verifies the delete statements are present, or that
sandboxEnvpassed tostreamSandboxCreateexcludes these keys.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1816 - 1819, Add a defensive test that ensures the credential keys are removed before creating the sandbox: call or simulate the same code path that builds sandboxEnv (the code that copies process.env into sandboxEnv and runs delete for NVIDIA_API_KEY, DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN) and assert that sandboxEnv no longer has those properties (or mock/spy the call to streamSandboxCreate and assert the passed env object excludes those keys). Locate the logic around sandboxEnv and the call to streamSandboxCreate and add a unit test that explicitly checks sandboxEnv[NVIDIA_API_KEY], sandboxEnv[DISCORD_BOT_TOKEN], and sandboxEnv[SLACK_BOT_TOKEN] are undefined (or that streamSandboxCreate received an env object without those keys).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1817-1819: The code deletes Discord and Slack tokens from the
sandboxEnv but misses removing TELEGRAM_BOT_TOKEN, so add a deletion for
sandboxEnv.TELEGRAM_BOT_TOKEN in the same block (the same place where delete
sandboxEnv.NVIDIA_API_KEY, delete sandboxEnv.DISCORD_BOT_TOKEN, delete
sandboxEnv.SLACK_BOT_TOKEN are called) before calling streamSandboxCreate;
ensure sandboxEnv no longer contains TELEGRAM_BOT_TOKEN when passed into
streamSandboxCreate to prevent leaking the token into the sandbox environment.
---
Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1816-1819: Add a defensive test that ensures the credential keys
are removed before creating the sandbox: call or simulate the same code path
that builds sandboxEnv (the code that copies process.env into sandboxEnv and
runs delete for NVIDIA_API_KEY, DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN) and assert
that sandboxEnv no longer has those properties (or mock/spy the call to
streamSandboxCreate and assert the passed env object excludes those keys).
Locate the logic around sandboxEnv and the call to streamSandboxCreate and add a
unit test that explicitly checks sandboxEnv[NVIDIA_API_KEY],
sandboxEnv[DISCORD_BOT_TOKEN], and sandboxEnv[SLACK_BOT_TOKEN] are undefined (or
that streamSandboxCreate received an env object without those keys).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 56f004ea-2e71-4082-85da-58cba9169f7b
📒 Files selected for processing (1)
bin/lib/onboard.js
Add verifyProviderExists() check post-sandbox-creation to confirm messaging credential providers are actually in the gateway. Warns with remediation steps if a provider is missing.
…aging tokens Verify that messaging credentials flow through the provider system: - Provider create commands issued with correct credential key names - Sandbox create includes --provider flags for all detected tokens - Real token values never appear in sandbox create command or env
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
bin/lib/onboard.js (1)
1815-1824:⚠️ Potential issue | 🔴 Critical
sandboxEnvis still a secret leak path.
createSandbox()runs aftersetupInference(), andsetupInference()hydrates the selected inference credential back intoprocess.env. Starting from{ ...process.env }and deleting only three names still forwards rawOPENAI_API_KEY,ANTHROPIC_API_KEY,GEMINI_API_KEY,COMPATIBLE_API_KEY,COMPATIBLE_ANTHROPIC_API_KEY, andTELEGRAM_BOT_TOKENwhen present, which contradicts the “non-sensitive env vars only” comment.Proposed fix
- const sandboxEnv = { ...process.env }; - delete sandboxEnv.NVIDIA_API_KEY; - delete sandboxEnv.DISCORD_BOT_TOKEN; - delete sandboxEnv.SLACK_BOT_TOKEN; + const blockedSandboxEnvNames = new Set([ + "NVIDIA_API_KEY", + "OPENAI_API_KEY", + "ANTHROPIC_API_KEY", + "GEMINI_API_KEY", + "COMPATIBLE_API_KEY", + "COMPATIBLE_ANTHROPIC_API_KEY", + "DISCORD_BOT_TOKEN", + "SLACK_BOT_TOKEN", + "TELEGRAM_BOT_TOKEN", + ]); + const sandboxEnv = Object.fromEntries( + Object.entries(process.env).filter(([name]) => !blockedSandboxEnvNames.has(name)) + );An explicit allowlist would be safer long-term, but the current blocklist is incomplete even for credentials this file already hydrates itself.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1815 - 1824, The sandboxEnv assignment leaks secret keys because it clones process.env then deletes a few tokens; change it to build sandboxEnv from an explicit allowlist of non-sensitive variables instead of spreading process.env. In the block that currently defines envArgs and sandboxEnv (symbols: formatEnvAssignment, envArgs, sandboxEnv), replace the copy-and-delete approach with a whitelist array of safe env names and populate sandboxEnv only from those names (ensuring formatEnvAssignment still sets CHAT_UI_URL), and remove reliance on setupInference() hydrating secrets into process.env when constructing sandboxEnv.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1789-1791: The Telegram branch creates the provider by calling
upsertProvider() but doesn't populate the host env from stored credentials, so
legacy bridge won't see TELEGRAM_BOT_TOKEN; call hydrateCredentialEnv() (the
helper that writes stored credentials into process.env) before creating the
Telegram provider and before any early return that creates/upserts the provider
(i.e., add hydrateCredentialEnv() calls in the Telegram handling branches around
where upsertProvider() is invoked) so stored tokens are applied to process.env
prior to provider creation.
- Around line 480-483: The current verifyProviderExists function only checks
output text for "not found" which can hide command failures or an unbound
sandbox; change verifyProviderExists (and the duplicate call site around the
other provider-check code) to rely on the command exit status from
runCaptureOpenshell (or its underlying spawn API) instead of string matching,
and additionally validate the provider is attached to the target sandbox by
running a sandbox-scoped query (e.g., list/inspect the sandbox's providers) and
confirming the provider name appears in that result; update logic in
verifyProviderExists to return false on non-zero exit or when the
sandbox-attached provider list does not include the given name.
---
Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 1815-1824: The sandboxEnv assignment leaks secret keys because it
clones process.env then deletes a few tokens; change it to build sandboxEnv from
an explicit allowlist of non-sensitive variables instead of spreading
process.env. In the block that currently defines envArgs and sandboxEnv
(symbols: formatEnvAssignment, envArgs, sandboxEnv), replace the copy-and-delete
approach with a whitelist array of safe env names and populate sandboxEnv only
from those names (ensuring formatEnvAssignment still sets CHAT_UI_URL), and
remove reliance on setupInference() hydrating secrets into process.env when
constructing sandboxEnv.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: aa829268-dd6c-401d-87b5-6321a6d7a4e4
📒 Files selected for processing (1)
bin/lib/onboard.js
- Use exit code instead of string matching in verifyProviderExists() - Use hydrateCredentialEnv() for Telegram so host-side bridge can find the token from stored credentials - Replace individual delete statements with a blocklist that covers all credential env vars (NVIDIA, OpenAI, Anthropic, Gemini, compatible endpoints, Discord, Slack, Telegram)
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
bin/lib/onboard.js (1)
480-483:⚠️ Potential issue | 🟠 MajorThis still gives false confidence about provider attachment.
verifyProviderExists()only runsopenshell provider get <name>, so it proves the provider exists globally, not that the--providerflag actually attached it to this sandbox. The comment on Line 1920 says the opposite. Because this goes throughrunOpenshell(), the rawprovider getoutput/errors will also hit onboarding logs before your custom warning. Please make this probe silent and inspect sandbox-attached providers instead; if the CLI cannot do that yet, at least rename this to an existence check so it does not imply attachment validation.Also applies to: 1920-1927
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/onboard.test.js`:
- Around line 1032-1144: The test must also assert that sensitive env vars are
not forwarded in the sandbox env: after locating payload.commands and the
createCommand entry (as you already do), add assertions that createCommand.env
is either null/undefined or does not contain the keys/values for
DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN, TELEGRAM_BOT_TOKEN (and the new API-key
blocklist key if applicable); reference the existing payload.commands array and
the createCommand variable and assert createCommand.env does not include the
secret names or the literal token values used earlier in the harness.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 37c32b17-312a-4a82-85ed-8918760e22b7
📒 Files selected for processing (2)
bin/lib/onboard.jstest/onboard.test.js
The test pattern-matched on the old `{ ...process.env }` spread.
Update to verify the blocklist approach that strips all credential
env vars from sandboxEnv.
|
Reviewed the diff — this looks solid. A few observations from the OpenClaw side:
Happy to help test this against real Discord/Slack/Telegram bridges if useful. |
…tial-providers Signed-off-by: Aaron Erickson <aerickson@nvidia.com> # Conflicts: # bin/lib/onboard.js
Rename verifyProviderExists to providerExistsInGateway to clarify it only checks gateway-level existence, not sandbox attachment. Add BEDROCK_API_KEY to the credential blocklist (merged via #963). Add env var leakage assertions to the messaging provider test. Add JSDoc to buildProviderArgs, upsertProvider, and providerExistsInGateway. Increase messaging provider test timeout for post-merge code paths. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/credential-exposure.test.js (1)
79-85: Test assertions are loose but follow existing patterns.These regex checks verify the strings appear somewhere in the source file, not specifically within the
blockedSandboxEnvNamesSet. This could pass if a credential name only appears in a comment while removed from the blocklist.Consider tightening the assertions to verify the blocklist structure more precisely:
🔧 Suggested improvement
- // sandboxEnv must be built with a blocklist that strips all credential env vars - expect(src).toMatch(/blockedSandboxEnvNames/); - expect(src).toMatch(/NVIDIA_API_KEY/); - expect(src).toMatch(/BEDROCK_API_KEY/); - expect(src).toMatch(/DISCORD_BOT_TOKEN/); - expect(src).toMatch(/SLACK_BOT_TOKEN/); - expect(src).toMatch(/TELEGRAM_BOT_TOKEN/); + // sandboxEnv must be built with a blocklist that strips all credential env vars + expect(src).toMatch(/blockedSandboxEnvNames\s*=\s*new\s+Set\(\[/); + expect(src).toMatch(/blockedSandboxEnvNames.*"NVIDIA_API_KEY"/s); + expect(src).toMatch(/blockedSandboxEnvNames.*"BEDROCK_API_KEY"/s); + expect(src).toMatch(/blockedSandboxEnvNames.*"DISCORD_BOT_TOKEN"/s); + expect(src).toMatch(/blockedSandboxEnvNames.*"SLACK_BOT_TOKEN"/s); + expect(src).toMatch(/blockedSandboxEnvNames.*"TELEGRAM_BOT_TOKEN"/s);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/credential-exposure.test.js` around lines 79 - 85, The current assertions only check that credential names exist somewhere in the source; tighten them to verify those names are actually inside the blockedSandboxEnvNames set. Update the test around blockedSandboxEnvNames to assert the Set initializer/contents include each credential (e.g., match a pattern where "blockedSandboxEnvNames" is assigned to a Set/array literal and the literal text contains "NVIDIA_API_KEY", "BEDROCK_API_KEY", etc.), or parse the src to extract the blockedSandboxEnvNames value and assert the Set contains those keys; target the "blockedSandboxEnvNames" symbol so the credentials are validated inside the blocklist rather than anywhere in the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/credential-exposure.test.js`:
- Around line 79-85: The current assertions only check that credential names
exist somewhere in the source; tighten them to verify those names are actually
inside the blockedSandboxEnvNames set. Update the test around
blockedSandboxEnvNames to assert the Set initializer/contents include each
credential (e.g., match a pattern where "blockedSandboxEnvNames" is assigned to
a Set/array literal and the literal text contains "NVIDIA_API_KEY",
"BEDROCK_API_KEY", etc.), or parse the src to extract the blockedSandboxEnvNames
value and assert the Set contains those keys; target the
"blockedSandboxEnvNames" symbol so the credentials are validated inside the
blocklist rather than anywhere in the file.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d8c09f02-0c95-4caa-993e-7bd03205125f
📒 Files selected for processing (3)
bin/lib/onboard.jstest/credential-exposure.test.jstest/onboard.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
- test/onboard.test.js
…tial-providers Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Add direct in-process tests for buildProviderArgs (all branch paths), formatEnvAssignment, compactText, getNavigationChoice, parsePolicyPresetEnv, summarizeCurlFailure, and summarizeProbeFailure. Add subprocess tests for upsertProvider (create, update fallback, and both-fail error paths), providerExistsInGateway (true/false), and hydrateCredentialEnv (null input, stored credential, missing key). Export newly-tested functions from onboard.js. CLI function coverage: 33.24% (threshold 32%). Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
test/onboard.test.js (1)
1372-1504: Add a failing-upsert path test forcreateSandbox()provider attachment.Current coverage is happy-path only. Please add a case where one messaging provider upsert returns
{ ok: false }and assert sandbox creation either aborts or skips attaching that provider. This will guard against silent broken wiring regressions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/onboard.test.js` around lines 1372 - 1504, Add a new test case (another it block) that calls createSandbox but mocks the provider upsert path to simulate a failure: in the injected script override runner.run or runner.runCapture (the same symbol used in the existing messaging test) so that the provider create/upsert command returns a payload indicating failure (e.g., a string or status that your production code interprets as { ok: false }), then assert that createSandbox either aborts (nonzero exit) or that the provider create command is not followed by attaching that provider to the sandbox (inspect the captured commands and spawn env like in the existing test). Reuse the same helpers in the script (runner.run, runner.runCapture, registry.registerSandbox, childProcess.spawn, createSandbox) and assert the expected behavior for the failing-upsert case (sandbox creation skipped or provider absent).bin/lib/onboard.js (1)
915-917: Suppress CLI probe noise in provider existence checks.On Line 916, this uses
runOpenshell()with default stdio, so negative probes can emit raw CLI errors unnecessarily. Consider piping stdio for cleaner onboarding output.💡 Proposed tweak
function providerExistsInGateway(name) { - const result = runOpenshell(["provider", "get", name], { ignoreError: true }); + const result = runOpenshell(["provider", "get", name], { + ignoreError: true, + stdio: ["ignore", "pipe", "pipe"], + }); return result.status === 0; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 915 - 917, The providerExistsInGateway function currently calls runOpenshell(["provider", "get", name], { ignoreError: true }) which leaves stdio at the default and allows CLI error output to leak; change the call in providerExistsInGateway to pass a stdio option that pipes/suppresses subprocess output (e.g., include stdio: "pipe" or equivalent in the options object) so negative probes don't emit raw CLI errors while preserving ignoreError: true and the existing status check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 2223-2240: The upsertProvider calls for "discord-bridge",
"slack-bridge", and "telegram-bridge" may fail but their results are ignored,
yet the code still pushes the provider names into messagingProviders and later
appends them to createArgs; modify the logic around upsertProvider (the three
calls) to capture its return/throw, verify success (e.g., truthy result or no
thrown error) before pushing the provider name into messagingProviders and
adding "--provider" to createArgs, and ensure errors from upsertProvider are
handled (log/throw/skip) so failed upserts do not result in invalid --provider
flags; relevant symbols: upsertProvider, messagingProviders, createArgs,
getCredential, hydrateCredentialEnv.
---
Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 915-917: The providerExistsInGateway function currently calls
runOpenshell(["provider", "get", name], { ignoreError: true }) which leaves
stdio at the default and allows CLI error output to leak; change the call in
providerExistsInGateway to pass a stdio option that pipes/suppresses subprocess
output (e.g., include stdio: "pipe" or equivalent in the options object) so
negative probes don't emit raw CLI errors while preserving ignoreError: true and
the existing status check.
In `@test/onboard.test.js`:
- Around line 1372-1504: Add a new test case (another it block) that calls
createSandbox but mocks the provider upsert path to simulate a failure: in the
injected script override runner.run or runner.runCapture (the same symbol used
in the existing messaging test) so that the provider create/upsert command
returns a payload indicating failure (e.g., a string or status that your
production code interprets as { ok: false }), then assert that createSandbox
either aborts (nonzero exit) or that the provider create command is not followed
by attaching that provider to the sandbox (inspect the captured commands and
spawn env like in the existing test). Reuse the same helpers in the script
(runner.run, runner.runCapture, registry.registerSandbox, childProcess.spawn,
createSandbox) and assert the expected behavior for the failing-upsert case
(sandbox creation skipped or provider absent).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8c66deaf-e3f2-435f-925a-6b11b2785952
📒 Files selected for processing (2)
bin/lib/onboard.jstest/onboard.test.js
…tial-providers Signed-off-by: Aaron Erickson <aerickson@nvidia.com> # Conflicts: # bin/lib/onboard.js # test/onboard.test.js
…tial-providers Signed-off-by: Aaron Erickson <aerickson@nvidia.com> # Conflicts: # bin/lib/onboard.js
When openshell sandbox create exits with SSH 255 after printing "Created sandbox:", NemoClaw previously hard-exited instead of checking whether the sandbox reached Ready state. This was a regression from the create-stream extraction (#1516) combined with the messaging provider migration path (#1081, #1527) that forces sandbox recreation. Two fixes: - streamSandboxCreate: do one final readyCheck on non-zero close to catch the race where the sandbox is already Ready when SSH dies. - onboard.js: when failure is sandbox_create_incomplete, fall through to the existing ready-wait loop (60s polling) instead of exiting. Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…1598) ## Summary - When `openshell sandbox create` exits with SSH 255 after printing "Created sandbox:", NemoClaw now treats this as recoverable instead of hard-exiting - Added a final `readyCheck` in `streamSandboxCreate` on non-zero close to catch the race where the sandbox is already Ready when SSH dies - In `onboard.js`, `sandbox_create_incomplete` failures now fall through to the existing 60s ready-wait loop instead of calling `process.exit()` ## Root cause Regression from #1516 (create-stream extraction) combined with #1081/#1527 (messaging provider migration forcing sandbox recreation). The create stream returns non-zero after "Created sandbox:" and NemoClaw exits before checking if the sandbox reached Ready state. ## Test plan - [x] New unit tests: stream recovers when readyCheck is true at close time (SSH 255) - [x] New unit test: stream still returns non-zero when readyCheck is false at close time - [x] All existing `sandbox-create-stream` tests pass (7/7) - [x] All onboard integration tests pass (83/83) - [x] All src unit tests pass (538/538) - [ ] Manual: trigger SSH 255 during sandbox create and verify recovery <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Enhanced sandbox creation error handling with improved failure classification and recovery messaging. * Improved sandbox readiness detection to prevent unnecessary creation failures when the sandbox is operational. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Auto-merged: openclaw-sandbox.yaml, package.json, scripts/debug.sh Conflict resolved: - scripts/telegram-bridge.js: NVIDIA deleted it in NVIDIA#1081 (provider-based messaging), but we keep our version because the host-side bridge is still actively running in our PM2 (uptime >100m) and referenced by scripts/monitor.js health checks. Incoming upstream highlights: - Security: SSRF validation coverage, shell-quoted sandboxName, DELETE wildcard restrictions, SSH 255 recovery - Runtime: Podman support (macOS/Linux), signal handlers in start script - CLI: nvapi- key validation, reuse-sandbox prompts, credentials reset - Policy presets: brew, HuggingFace router, sentry scope - Docs: security credential storage, OpenShell lifecycle split
Auto-merged: openclaw-sandbox.yaml, package.json, scripts/debug.sh Conflict resolved: - scripts/telegram-bridge.js: NVIDIA deleted it in NVIDIA#1081 (provider-based messaging), but we keep our version because the host-side bridge is still actively running in our PM2 (uptime >100m) and referenced by scripts/monitor.js health checks. Incoming upstream highlights: - Security: SSRF validation coverage, shell-quoted sandboxName, DELETE wildcard restrictions, SSH 255 recovery - Runtime: Podman support (macOS/Linux), signal handlers in start script - CLI: nvapi- key validation, reuse-sandbox prompts, credentials reset - Policy presets: brew, HuggingFace router, sentry scope - Docs: security credential storage, OpenShell lifecycle split
…IA#1081) ## Summary Use the OpenShell provider system for messaging credential injection instead of raw env var passthrough. Discord, Slack, and Telegram tokens now flow through the placeholder/proxy pipeline — sandbox processes never see real values. The host-side Telegram bridge is removed; messaging channels are baked into `openclaw.json` at image build time via `NEMOCLAW_MESSAGING_CHANNELS_B64`, and the L7 proxy rewrites placeholders with real secrets at egress — no runtime config patching needed. Signed-off-by: Aaron Erickson <aerickson@nvidia.com> ## Related Issues Fixes NVIDIA#1109 Fixes NVIDIA#616 Fixes NVIDIA#1310 Supersedes NVIDIA#617 ## Changes - **`bin/lib/onboard.js`** — Create `generic` providers for Discord, Slack, and Telegram tokens via `upsertProvider()`. Attach to sandbox via `--provider` flags. Replace individual env var deletes with a comprehensive blocklist. Bake messaging channel config into `openclaw.json` at build time. Collect Telegram user ID for DM allowlisting. - **`Dockerfile`** — Accept `NEMOCLAW_MESSAGING_CHANNELS_B64` build arg and inject channel config into `openclaw.json` at image build time. - **`scripts/nemoclaw-start.sh`** — Remove dead runtime `openclaw.json` patching from `configure_messaging_channels`. Allow CLI clients in auto-pair watcher. - **`nemoclaw/src/lib/services.ts`** — Remove stale `telegram-bridge` spawn. - **`scripts/telegram-bridge.js`** — Removed (replaced by native OpenClaw channels via providers). - **`test/onboard.test.js`** — Verify provider create commands, `--provider` flags on sandbox create, and that real token values never appear in the sandbox create command. - **`test/credential-exposure.test.js`** — Updated for expanded blocklist coverage. - **`test/e2e/messaging-providers.test.sh`** — New E2E test: provider creation, sandbox attachment, DM allowlisting. ## Thanks - @sayalinvidia — tested Discord end-to-end, diagnosed that Landlock makes `openclaw.json` immutable at runtime in non-root mode, and proposed the build-time bake approach via `NEMOCLAW_MESSAGING_CHANNELS_B64` that made this work (PR NVIDIA#1501) - @mercl-lau — found the stale `telegram-bridge` spawn in `services.ts` that silently crashed after the bridge script was removed - @stevenrick — tested Telegram on Brev, independently confirmed the Landlock issue, and found that the auto-pair watcher rejected CLI clients (also opened NVIDIA#1496) ## Type of Change - [x] Code change for a new feature, bug fix, or refactor. - [ ] Code change with doc updates. - [ ] Doc only. Prose changes without code sample modifications. - [ ] Doc only. Includes code sample changes. ## Testing - [x] `npx prek run --all-files` passes (or equivalently `make check`). - [x] `npm test` passes. - [ ] `make docs` builds without warnings. (for doc-only changes) - [x] E2E validated with real bot tokens on Brev instance ## Checklist ### General - [x] I have read and followed the [contributing guide](https://github.com/NVIDIA/NemoClaw/blob/main/CONTRIBUTING.md). - [ ] I have read and followed the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md). (for doc-only changes) ### Code Changes - [x] Formatters applied — `npx prek run --all-files` auto-fixes formatting (or `make format` for targeted runs). - [x] Tests added or updated for new or changed behavior. - [x] No secrets, API keys, or credentials committed. - [ ] Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs). ### Doc Changes N/A --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: sayalinvidia <sayalinvidia@users.noreply.github.com>
## Summary - Reverts the unrelated `CLAUDE.md` modification that was inadvertently included in the security credential injection PR (NVIDIA#1081) - The added "Claude Behavior Rules" section was never reviewed as part of that security change and was not requested ## Test plan - [x] Verify `CLAUDE.md` matches its pre-NVIDIA#1081 state - [x] No other files affected <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Removed obsolete internal documentation entries. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
…VIDIA#1598) ## Summary - When `openshell sandbox create` exits with SSH 255 after printing "Created sandbox:", NemoClaw now treats this as recoverable instead of hard-exiting - Added a final `readyCheck` in `streamSandboxCreate` on non-zero close to catch the race where the sandbox is already Ready when SSH dies - In `onboard.js`, `sandbox_create_incomplete` failures now fall through to the existing 60s ready-wait loop instead of calling `process.exit()` ## Root cause Regression from NVIDIA#1516 (create-stream extraction) combined with NVIDIA#1081/NVIDIA#1527 (messaging provider migration forcing sandbox recreation). The create stream returns non-zero after "Created sandbox:" and NemoClaw exits before checking if the sandbox reached Ready state. ## Test plan - [x] New unit tests: stream recovers when readyCheck is true at close time (SSH 255) - [x] New unit test: stream still returns non-zero when readyCheck is false at close time - [x] All existing `sandbox-create-stream` tests pass (7/7) - [x] All onboard integration tests pass (83/83) - [x] All src unit tests pass (538/538) - [ ] Manual: trigger SSH 255 during sandbox create and verify recovery <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Enhanced sandbox creation error handling with improved failure classification and recovery messaging. * Improved sandbox readiness detection to prevent unnecessary creation failures when the sandbox is operational. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
`nemoclaw stop` only killed the host-side `cloudflared` process but left the OpenClaw gateway (and its Telegram/Discord/Slack messaging channels) running inside the sandbox. Bots continued responding to messages after stop. After PR #1081 moved messaging bridges into the sandbox, no stop mechanism was added to replace the old host-side process kill. Add `stopSandboxChannels()` to `services.ts` that execs into the sandbox via `openshell sandbox exec <name> -- pkill -TERM -f "openclaw gateway"`. The gateway's own signal handler (`cleanup()` in nemoclaw-start.sh) forwards SIGTERM to child processes, which stops all messaging channel polling. `stopAll()` now calls `stopSandboxChannels()` before stopping host-side services. When the sandbox name is unavailable or openshell is not installed, it warns and continues — host-side cleanup still proceeds. - [x] `npx vitest run --project cli src/lib/services.test.ts` — 14 existing tests pass - [x] `npx vitest run --project cli src/lib/services-sandbox.test.ts` — 9 new tests pass - [x] `npx vitest run --project cli src/lib/services-command.test.ts` — 4 tests pass - [x] `npm test` — full suite (1635 tests) passes - [x] `npm run typecheck:cli` — clean - [ ] Manual: `nemoclaw stop` → Telegram/Discord bots stop responding Signed-off-by: Claude <noreply@anthropic.com>
`nemoclaw stop` only killed the host-side `cloudflared` process but left the OpenClaw gateway (and its Telegram/Discord/Slack messaging channels) running inside the sandbox. Bots continued responding to messages after stop. After PR #1081 moved messaging bridges into the sandbox, no stop mechanism was added to replace the old host-side process kill. Add `stopSandboxChannels()` to `services.ts` that execs into the sandbox via `openshell sandbox exec <name> -- pkill -TERM -f "openclaw gateway"`. The gateway's own signal handler (`cleanup()` in nemoclaw-start.sh) forwards SIGTERM to child processes, which stops all messaging channel polling. `stopAll()` now calls `stopSandboxChannels()` before stopping host-side services. When the sandbox name is unavailable or openshell is not installed, it warns and continues — host-side cleanup still proceeds. - [x] `npx vitest run --project cli src/lib/services.test.ts` — 14 existing tests pass - [x] `npx vitest run --project cli src/lib/services-sandbox.test.ts` — 9 new tests pass - [x] `npx vitest run --project cli src/lib/services-command.test.ts` — 4 tests pass - [x] `npm test` — full suite (1635 tests) passes - [x] `npm run typecheck:cli` — clean - [ ] Manual: `nemoclaw stop` → Telegram/Discord bots stop responding Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>
## Summary Fixes #2103 where `nemoclaw stop` lied — it only stopped the cloudflared tunnel, not Telegram/Discord bridges, which moved inside the sandbox in #1081 and couldn't be paused without destroying the sandbox. - **Rename** `nemoclaw stop/start` → `nemoclaw tunnel stop/start`. Legacy `stop`/`start` kept as deprecated aliases that print a warning and delegate. - **New** `nemoclaw <sandbox> channels stop <channel>` and `channels start <channel>`. Marks the channel disabled in the per-sandbox registry, rebuilds, and the onboard code skips registering that bridge with the gateway. Credentials stay in the keychain — non-destructive, unlike `channels remove`. ## Why rebuild and not `openshell sandbox exec` Aaron floated the shields-down exec-into-sandbox pattern in the #2103 thread. On closer look, that path doesn't work cleanly for this: - `openclaw.json` is read-only at runtime (Landlock + `root:root 444`), and NemoClaw's sandbox entrypoint explicitly intercepts `openclaw channels add/remove` from inside with a host-side-only error. - `openclaw gateway stop` is too coarse — kills every channel plus agent comms. - SIGTERM on the bridge process works but doesn't persist across a gateway restart. - No per-channel runtime CLI exists in OpenClaw today. So this PR mirrors `channels remove`'s config-patch + rebuild plumbing. A rebuild takes a few minutes and preserves workspace data, which is strictly better than the current "delete the sandbox" workaround. ## Note on #2103's policy-remove claim #2103's issue body said "No remove/delete/revoke logic exists anywhere in `policies.ts`." That's stale — `removePreset()` at `src/lib/policies.ts:278` and the `nemoclaw <sandbox> policy-remove <preset>` CLI dispatch at `src/nemoclaw.ts:3088` already exist. No code change needed for that line item. ## Test plan - [x] `npx vitest run --project cli` — 1740 passed, 7 skipped (3 new registry tests for `setChannelDisabled`/`getDisabledChannels`) - [x] `npm run typecheck:cli` clean - [x] Manual: help text renders new commands, `nemoclaw stop` prints deprecation + still works, `nemoclaw tunnel stop` works directly, `nemoclaw tunnel bogus` prints usage - [x] Manual: `nemoclaw <sandbox> channels stop telegram` with fake registry → registry gains `disabledChannels: ["telegram"]`; `channels start telegram` removes the entry; dry-run honored; unknown channel rejected - [ ] Not yet validated against a live sandbox rebuild (the onboard-side filter that excludes `disabledChannels` from `messagingTokenDefs` is a one-liner, but hasn't been E2E-tested end-to-end) 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added `tunnel start|stop` command for service management * Added `channels start|stop` commands to enable/disable specific messaging channels in sandboxes * **Deprecations** * Marked `nemoclaw start|stop` as deprecated; use `tunnel start|stop` instead * **Tests** * Added test coverage for channel enable/disable functionality <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
<!-- markdownlint-disable MD041 --> ## Summary `nemoclaw stop` only killed the host-side `cloudflared` process but left the OpenClaw gateway (and its Telegram/Discord/Slack messaging channels) running inside the sandbox. After PR #1081 moved bridges into the sandbox, no stop mechanism was added. This PR adds `stopSandboxChannels()` which execs into the sandbox via `openshell sandbox exec` to SIGTERM the gateway process, stopping all messaging channel polling. ## Related Issue Fixes #1825 ## Changes - Add `stopSandboxChannels()` to `src/lib/services.ts` — resolves `openshell`, execs `pkill -TERM -f "openclaw gateway"` inside the sandbox, and handles all pkill exit codes (0 = stopped, 1 = already stopped, >1 = unreachable) - Update `stopAll()` to call `stopSandboxChannels()` before stopping host-side services; reads sandbox name from opts, `NEMOCLAW_SANDBOX`, or `SANDBOX_NAME` env vars; warns when no sandbox name is available - Import `spawnSync` from `node:child_process` and `resolveOpenshell` from `./resolve-openshell` - Add `src/lib/services-sandbox.test.ts` with 9 tests covering: successful gateway stop, pkill exit 1 (no match), unreachable sandbox, null status (timeout), openshell not found, stopAll with sandbox name, stopAll without sandbox name, cloudflared cleanup on exec failure, and env var fallback ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Introduced sandbox-aware shutdown flow: attempts in-sandbox gateway cleanup when a validated sandbox name is available, prefers validated env-provided sandbox names (with correct precedence), and logs clearer success/warning outcomes for varied exit statuses and missing in-sandbox tooling. * **Tests** * Added tests for sandbox shutdown paths, missing/failed in-sandbox tooling, env-var vs registry name selection and validation (rejecting unsafe names), and ensuring host-side cleanup still runs. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com> Co-authored-by: Aaron Erickson 🦞 <aerickson@nvidia.com>
## Summary Wires the existing `e2e-branch-validation.yaml` reusable workflow into the nightly E2E pipeline as a new `brev-e2e` matrix job. This turns on end-to-end coverage of the **real Brev cloud provisioning + launchable bootstrap path**, which today is only exercised on manual `workflow_dispatch`. ## Why now Previously this couldn't run unattended because short-lived Brev refresh tokens would expire silently — the nightly would happily \`1 skipped, 0 passed\` every night until someone noticed (see the history in #1638). A **long-lived \`BREV_API_TOKEN\`** (CI service-account-style refresh token) is now available, which removes that failure mode. ## What changes ### \`.github/workflows/e2e-branch-validation.yaml\` - Re-enables the repo gate that was commented out during fork testing: \`\`\`yaml if: >- github.repository == 'NVIDIA/NemoClaw' || github.repository == 'jyaunches/NemoClaw' \`\`\` The fork is temporarily whitelisted so the long-lived token can be validated end-to-end before rotation into the upstream secret. ### \`.github/workflows/nightly-e2e.yaml\` - New \`brev-e2e\` job, invoked via \`uses: ./.github/workflows/e2e-branch-validation.yaml\` with a matrix of three test suites: - \`all\` — credential-sanitization + telegram-injection (security regression) - \`messaging-providers\` — Telegram + Discord L7 proxy chain (#1081) - \`full\` — install → onboard → inference → CLI (destroys sandbox on exit) - Added to \`notify-on-failure\` and \`report-to-pr\` \`needs:\` lists - Added to header comment + \`workflow_dispatch\` valid-jobs list - \`fail-fast: false\` so one suite's failure doesn't cancel the others ## Launchable coverage \`use_launchable: true\` (default) — the test harness uses \`scripts/brev-launchable-ci-cpu.sh\` as the Brev startup script. The cloud instance comes up pre-baked with Docker, Node 22, OpenShell CLI, and the NemoClaw repo; the E2E harness then rsyncs the target branch over the launchable's clone and runs the suite. This is distinct from the existing \`launchable-smoke-e2e\` job, which validates the launchable *script* on \`ubuntu-latest\` without any Brev cloud involvement. ## Cost ~\$0.10 per Brev CPU instance × 3 suites × 30 nights ≈ **~\$9/month**. ## Validation plan (before merge) 1. **Fork validation** — paste long-lived \`BREV_API_TOKEN\` + \`NVIDIA_API_KEY\` into \`jyaunches/NemoClaw\` repo secrets. 2. Dispatch \`nightly-e2e.yaml\` on the fork with \`jobs: brev-e2e\` to confirm the matrix works end-to-end on cheap Brev credits. 3. Rotate the long-lived token into \`NVIDIA/NemoClaw\` repo secrets. 4. Merge and monitor the first ~3 scheduled nightlies for token validity + flake rate. 5. Follow-up: remove the \`jyaunches/NemoClaw\` clause from the repo gate once stable. ## Related - #1638 — prior silent-skip footgun that a long-lived token now prevents - #1328 — Flavor 2 GPU launchable (orthogonal — this PR is CPU only) - #3263 — re-enable \`gpu-e2e\` (separate issue, not bundled here) --- ## Update (2026-05-11): switch to long-lived `--api-key` auth Brev CI/CD team issued a proper long-lived API key. Commit `07687a6ad` swaps the auth mechanism: - Brev CLI pinned **v0.6.322 → v0.6.324** (first release exposing `brev login --api-key` / `--org-id` flags) - Secret **`BREV_API_TOKEN` → `BREV_API_KEY`** (old name was misleading — `refresh_token` ≠ `api_key`) + new **`BREV_ORG_ID`** secret - Workflow auth step now calls `brev login --api-key "$BREV_API_KEY" --org-id "$BREV_ORG_ID"` instead of hand-writing `~/.brev/credentials.json` with a refresh_token (original workaround for brev login removal in #1470 is now obsolete) - `workflow_dispatch` override input `brev_token` → `brev_api_key` + new `brev_org_id` No test-harness changes required — the CLI still populates `credentials.json` at the same path, just with `api_key` + `api_key_org_id` fields alongside the existing `access_token` / `refresh_token`. ### Secret checklist (updated) | Secret | jyaunches/NemoClaw (fork) | NVIDIA/NemoClaw (upstream) | |---|---|---| | `BREV_API_KEY` | ✅ set | ⏳ pending | | `BREV_ORG_ID` | ✅ set | ⏳ pending | | `NVIDIA_API_KEY` | ✅ set | ✅ set | | ~~`BREV_API_TOKEN`~~ | obsolete (safe to delete) | obsolete (safe to delete after merge) | <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** - Switched CI auth to required long‑lived API key and org ID; pinned CLI and hardened auth verification with retries. - Improved workflow invocation and branch resolution logic; added inputs to toggle/use published launchable and specify its ID. - Restricted workflow execution scope to upstream repo and specific CI fork. * **Tests** - Added nightly Brev E2E job. - Enabled published-launchable provisioning with a startup‑script fallback. - Improved instance readiness checks and error handling; preserved teardown. [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3350) <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Use the OpenShell provider system for messaging credential injection instead of raw env var passthrough. Discord, Slack, and Telegram tokens now flow through the placeholder/proxy pipeline — sandbox processes never see real values. The host-side Telegram bridge is removed; messaging channels are baked into
openclaw.jsonat image build time viaNEMOCLAW_MESSAGING_CHANNELS_B64, and the L7 proxy rewrites placeholders with real secrets at egress — no runtime config patching needed.Signed-off-by: Aaron Erickson aerickson@nvidia.com
Related Issues
Fixes #1109
Fixes #616
Fixes #1310
Supersedes #617
Changes
bin/lib/onboard.js— Creategenericproviders for Discord, Slack, and Telegram tokens viaupsertProvider(). Attach to sandbox via--providerflags. Replace individual env var deletes with a comprehensive blocklist. Bake messaging channel config intoopenclaw.jsonat build time. Collect Telegram user ID for DM allowlisting.Dockerfile— AcceptNEMOCLAW_MESSAGING_CHANNELS_B64build arg and inject channel config intoopenclaw.jsonat image build time.scripts/nemoclaw-start.sh— Remove dead runtimeopenclaw.jsonpatching fromconfigure_messaging_channels. Allow CLI clients in auto-pair watcher.nemoclaw/src/lib/services.ts— Remove staletelegram-bridgespawn.scripts/telegram-bridge.js— Removed (replaced by native OpenClaw channels via providers).test/onboard.test.js— Verify provider create commands,--providerflags on sandbox create, and that real token values never appear in the sandbox create command.test/credential-exposure.test.js— Updated for expanded blocklist coverage.test/e2e/messaging-providers.test.sh— New E2E test: provider creation, sandbox attachment, DM allowlisting.Thanks
openclaw.jsonimmutable at runtime in non-root mode, and proposed the build-time bake approach viaNEMOCLAW_MESSAGING_CHANNELS_B64that made this work (PR fix(onboard): bake messaging channels into openclaw.json at build time #1501)telegram-bridgespawn inservices.tsthat silently crashed after the bridge script was removedType of Change
Testing
npx prek run --all-filespasses (or equivalentlymake check).npm testpasses.make docsbuilds without warnings. (for doc-only changes)Checklist
General
Code Changes
npx prek run --all-filesauto-fixes formatting (ormake formatfor targeted runs).Doc Changes
N/A
Summary by CodeRabbit
Release Notes
New Features
Tests
Chores