Skip to content

fix(security): use providers for messaging credential injection#1081

Merged
ericksoa merged 78 commits into
mainfrom
feat/messaging-credential-providers
Apr 5, 2026
Merged

fix(security): use providers for messaging credential injection#1081
ericksoa merged 78 commits into
mainfrom
feat/messaging-credential-providers

Conversation

@ericksoa

@ericksoa ericksoa commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Use the OpenShell provider system for messaging credential injection instead of raw env var passthrough. Discord, Slack, and Telegram tokens now flow through the placeholder/proxy pipeline — sandbox processes never see real values. The host-side Telegram bridge is removed; messaging channels are baked into openclaw.json at image build time via NEMOCLAW_MESSAGING_CHANNELS_B64, and the L7 proxy rewrites placeholders with real secrets at egress — no runtime config patching needed.

Signed-off-by: Aaron Erickson aerickson@nvidia.com

Related Issues

Fixes #1109
Fixes #616
Fixes #1310
Supersedes #617

Changes

  • bin/lib/onboard.js — Create generic providers for Discord, Slack, and Telegram tokens via upsertProvider(). Attach to sandbox via --provider flags. Replace individual env var deletes with a comprehensive blocklist. Bake messaging channel config into openclaw.json at build time. Collect Telegram user ID for DM allowlisting.
  • Dockerfile — Accept NEMOCLAW_MESSAGING_CHANNELS_B64 build arg and inject channel config into openclaw.json at image build time.
  • scripts/nemoclaw-start.sh — Remove dead runtime openclaw.json patching from configure_messaging_channels. Allow CLI clients in auto-pair watcher.
  • nemoclaw/src/lib/services.ts — Remove stale telegram-bridge spawn.
  • scripts/telegram-bridge.js — Removed (replaced by native OpenClaw channels via providers).
  • test/onboard.test.js — Verify provider create commands, --provider flags on sandbox create, and that real token values never appear in the sandbox create command.
  • test/credential-exposure.test.js — Updated for expanded blocklist coverage.
  • test/e2e/messaging-providers.test.sh — New E2E test: provider creation, sandbox attachment, DM allowlisting.

Thanks

Type of Change

  • Code change for a new feature, bug fix, or refactor.
  • Code change with doc updates.
  • Doc only. Prose changes without code sample modifications.
  • Doc only. Includes code sample changes.

Testing

  • npx prek run --all-files passes (or equivalently make check).
  • npm test passes.
  • make docs builds without warnings. (for doc-only changes)
  • E2E validated with real bot tokens on Brev instance

Checklist

General

Code Changes

  • Formatters applied — npx prek run --all-files auto-fixes formatting (or make format for targeted runs).
  • Tests added or updated for new or changed behavior.
  • No secrets, API keys, or credentials committed.
  • Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs).

Doc Changes

N/A

Summary by CodeRabbit

Release Notes

  • New Features

    • Added interactive messaging channel selection (Telegram, Discord, Slack) during onboarding with optional DM allowlist configuration.
    • Messaging now handled natively via OpenClaw provider system instead of separate bridge service.
  • Tests

    • Added end-to-end test suite for messaging provider credential isolation and network reachability validation.
  • Chores

    • Removed separate Telegram bridge service; simplified host service management to cloudflared only.

Create OpenShell providers for Discord, Slack, and Telegram tokens
during onboard instead of passing them as raw environment variables
into the sandbox. The L7 proxy rewrites Authorization headers
(Bearer/Bot) with real secrets at egress, so sandbox processes never
see the actual token values.

- Discord and Slack tokens flow through provider/placeholder system
- Telegram provider is created for credential storage but host-side
  bridge still reads from host env (Telegram uses URL-path auth that
  the proxy can't rewrite yet)
- Raw env var injection removed for Discord and Slack
@ericksoa ericksoa self-assigned this Mar 30, 2026
@coderabbitai

coderabbitai Bot commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

This PR was authored by the user configured for CodeRabbit reviews. CodeRabbit does not review PRs authored by this user. It's recommended to use a dedicated user account to post CodeRabbit review feedback.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3210cdcf-8f74-44d2-a471-05249a018076

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough
📝 Walkthrough
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.71% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'fix(security): use providers for messaging credential injection' clearly and concisely describes the primary change: replacing direct plaintext environment-variable credential passthrough with OpenShell provider-based credential injection for messaging tokens.
Linked Issues check ✅ Passed The code changes comprehensively address requirements from both #616 and #1109: provider-based credential injection replaces plaintext env var passthrough, blockedSandboxEnvNames implements robust credential blocklisting, messaging providers are attached via --provider flags, token values are excluded from sandbox environment, and host-side Telegram bridge is preserved as documented interim solution.
Out of Scope Changes check ✅ Passed All changes directly support the core objective of migrating messaging credentials to provider-based injection. Minor scope adjustments (telegram-bridge.js removal, scripts/start-services.sh simplification, scripts/nemoclaw-start.sh channel configuration) are logical consequences of moving credential handling from host-side bridge to in-sandbox provider system.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@ericksoa ericksoa changed the title feat(onboard): use providers for messaging credential injection fix(onboard): use providers for messaging credential injection Mar 30, 2026
@ericksoa ericksoa changed the title fix(onboard): use providers for messaging credential injection fix(security): use providers for messaging credential injection Mar 30, 2026
@ericksoa ericksoa added the security Potential vulnerability, unsafe behavior, or access risk label Mar 30, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
bin/lib/onboard.js (1)

1816-1819: Consider adding a defensive test for credential deletion.

Per the test at test/credential-exposure.test.js:76-84, the current tests verify that credentials aren't pushed to envArgs, but don't assert that the delete statements exist. This means if the deletion logic were accidentally removed, the test would still pass.

Consider adding a test that explicitly verifies the delete statements are present, or that sandboxEnv passed to streamSandboxCreate excludes these keys.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1816 - 1819, Add a defensive test that
ensures the credential keys are removed before creating the sandbox: call or
simulate the same code path that builds sandboxEnv (the code that copies
process.env into sandboxEnv and runs delete for NVIDIA_API_KEY,
DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN) and assert that sandboxEnv no longer has
those properties (or mock/spy the call to streamSandboxCreate and assert the
passed env object excludes those keys). Locate the logic around sandboxEnv and
the call to streamSandboxCreate and add a unit test that explicitly checks
sandboxEnv[NVIDIA_API_KEY], sandboxEnv[DISCORD_BOT_TOKEN], and
sandboxEnv[SLACK_BOT_TOKEN] are undefined (or that streamSandboxCreate received
an env object without those keys).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1817-1819: The code deletes Discord and Slack tokens from the
sandboxEnv but misses removing TELEGRAM_BOT_TOKEN, so add a deletion for
sandboxEnv.TELEGRAM_BOT_TOKEN in the same block (the same place where delete
sandboxEnv.NVIDIA_API_KEY, delete sandboxEnv.DISCORD_BOT_TOKEN, delete
sandboxEnv.SLACK_BOT_TOKEN are called) before calling streamSandboxCreate;
ensure sandboxEnv no longer contains TELEGRAM_BOT_TOKEN when passed into
streamSandboxCreate to prevent leaking the token into the sandbox environment.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1816-1819: Add a defensive test that ensures the credential keys
are removed before creating the sandbox: call or simulate the same code path
that builds sandboxEnv (the code that copies process.env into sandboxEnv and
runs delete for NVIDIA_API_KEY, DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN) and assert
that sandboxEnv no longer has those properties (or mock/spy the call to
streamSandboxCreate and assert the passed env object excludes those keys).
Locate the logic around sandboxEnv and the call to streamSandboxCreate and add a
unit test that explicitly checks sandboxEnv[NVIDIA_API_KEY],
sandboxEnv[DISCORD_BOT_TOKEN], and sandboxEnv[SLACK_BOT_TOKEN] are undefined (or
that streamSandboxCreate received an env object without those keys).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 56f004ea-2e71-4082-85da-58cba9169f7b

📥 Commits

Reviewing files that changed from the base of the PR and between a146385 and 6aa13c8.

📒 Files selected for processing (1)
  • bin/lib/onboard.js

Comment thread bin/lib/onboard.js Outdated
Add verifyProviderExists() check post-sandbox-creation to confirm
messaging credential providers are actually in the gateway. Warns
with remediation steps if a provider is missing.
…aging tokens

Verify that messaging credentials flow through the provider system:
- Provider create commands issued with correct credential key names
- Sandbox create includes --provider flags for all detected tokens
- Real token values never appear in sandbox create command or env

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
bin/lib/onboard.js (1)

1815-1824: ⚠️ Potential issue | 🔴 Critical

sandboxEnv is still a secret leak path.

createSandbox() runs after setupInference(), and setupInference() hydrates the selected inference credential back into process.env. Starting from { ...process.env } and deleting only three names still forwards raw OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, COMPATIBLE_API_KEY, COMPATIBLE_ANTHROPIC_API_KEY, and TELEGRAM_BOT_TOKEN when present, which contradicts the “non-sensitive env vars only” comment.

Proposed fix
-  const sandboxEnv = { ...process.env };
-  delete sandboxEnv.NVIDIA_API_KEY;
-  delete sandboxEnv.DISCORD_BOT_TOKEN;
-  delete sandboxEnv.SLACK_BOT_TOKEN;
+  const blockedSandboxEnvNames = new Set([
+    "NVIDIA_API_KEY",
+    "OPENAI_API_KEY",
+    "ANTHROPIC_API_KEY",
+    "GEMINI_API_KEY",
+    "COMPATIBLE_API_KEY",
+    "COMPATIBLE_ANTHROPIC_API_KEY",
+    "DISCORD_BOT_TOKEN",
+    "SLACK_BOT_TOKEN",
+    "TELEGRAM_BOT_TOKEN",
+  ]);
+  const sandboxEnv = Object.fromEntries(
+    Object.entries(process.env).filter(([name]) => !blockedSandboxEnvNames.has(name))
+  );

An explicit allowlist would be safer long-term, but the current blocklist is incomplete even for credentials this file already hydrates itself.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1815 - 1824, The sandboxEnv assignment leaks
secret keys because it clones process.env then deletes a few tokens; change it
to build sandboxEnv from an explicit allowlist of non-sensitive variables
instead of spreading process.env. In the block that currently defines envArgs
and sandboxEnv (symbols: formatEnvAssignment, envArgs, sandboxEnv), replace the
copy-and-delete approach with a whitelist array of safe env names and populate
sandboxEnv only from those names (ensuring formatEnvAssignment still sets
CHAT_UI_URL), and remove reliance on setupInference() hydrating secrets into
process.env when constructing sandboxEnv.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1789-1791: The Telegram branch creates the provider by calling
upsertProvider() but doesn't populate the host env from stored credentials, so
legacy bridge won't see TELEGRAM_BOT_TOKEN; call hydrateCredentialEnv() (the
helper that writes stored credentials into process.env) before creating the
Telegram provider and before any early return that creates/upserts the provider
(i.e., add hydrateCredentialEnv() calls in the Telegram handling branches around
where upsertProvider() is invoked) so stored tokens are applied to process.env
prior to provider creation.
- Around line 480-483: The current verifyProviderExists function only checks
output text for "not found" which can hide command failures or an unbound
sandbox; change verifyProviderExists (and the duplicate call site around the
other provider-check code) to rely on the command exit status from
runCaptureOpenshell (or its underlying spawn API) instead of string matching,
and additionally validate the provider is attached to the target sandbox by
running a sandbox-scoped query (e.g., list/inspect the sandbox's providers) and
confirming the provider name appears in that result; update logic in
verifyProviderExists to return false on non-zero exit or when the
sandbox-attached provider list does not include the given name.

---

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 1815-1824: The sandboxEnv assignment leaks secret keys because it
clones process.env then deletes a few tokens; change it to build sandboxEnv from
an explicit allowlist of non-sensitive variables instead of spreading
process.env. In the block that currently defines envArgs and sandboxEnv
(symbols: formatEnvAssignment, envArgs, sandboxEnv), replace the copy-and-delete
approach with a whitelist array of safe env names and populate sandboxEnv only
from those names (ensuring formatEnvAssignment still sets CHAT_UI_URL), and
remove reliance on setupInference() hydrating secrets into process.env when
constructing sandboxEnv.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa829268-dd6c-401d-87b5-6321a6d7a4e4

📥 Commits

Reviewing files that changed from the base of the PR and between 6aa13c8 and d898544.

📒 Files selected for processing (1)
  • bin/lib/onboard.js

Comment thread bin/lib/onboard.js Outdated
Comment thread bin/lib/onboard.js Outdated
- Use exit code instead of string matching in verifyProviderExists()
- Use hydrateCredentialEnv() for Telegram so host-side bridge can
  find the token from stored credentials
- Replace individual delete statements with a blocklist that covers
  all credential env vars (NVIDIA, OpenAI, Anthropic, Gemini,
  compatible endpoints, Discord, Slack, Telegram)
@ericksoa ericksoa marked this pull request as draft March 30, 2026 03:07

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
bin/lib/onboard.js (1)

480-483: ⚠️ Potential issue | 🟠 Major

This still gives false confidence about provider attachment.

verifyProviderExists() only runs openshell provider get <name>, so it proves the provider exists globally, not that the --provider flag actually attached it to this sandbox. The comment on Line 1920 says the opposite. Because this goes through runOpenshell(), the raw provider get output/errors will also hit onboarding logs before your custom warning. Please make this probe silent and inspect sandbox-attached providers instead; if the CLI cannot do that yet, at least rename this to an existence check so it does not imply attachment validation.

Also applies to: 1920-1927

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/onboard.test.js`:
- Around line 1032-1144: The test must also assert that sensitive env vars are
not forwarded in the sandbox env: after locating payload.commands and the
createCommand entry (as you already do), add assertions that createCommand.env
is either null/undefined or does not contain the keys/values for
DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN, TELEGRAM_BOT_TOKEN (and the new API-key
blocklist key if applicable); reference the existing payload.commands array and
the createCommand variable and assert createCommand.env does not include the
secret names or the literal token values used earlier in the harness.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 37c32b17-312a-4a82-85ed-8918760e22b7

📥 Commits

Reviewing files that changed from the base of the PR and between d898544 and 9d4d5e1.

📒 Files selected for processing (2)
  • bin/lib/onboard.js
  • test/onboard.test.js

Comment thread test/onboard.test.js Outdated
The test pattern-matched on the old `{ ...process.env }` spread.
Update to verify the blocklist approach that strips all credential
env vars from sandboxEnv.
@WuKongAI-CMU

Copy link
Copy Markdown
Contributor

Reviewed the diff — this looks solid. A few observations from the OpenClaw side:

  1. Telegram URL-path auth caveat is well-handled — the comment about /bot{TOKEN}/ needing future proxy work is correct. The L7 proxy's header rewriting can't cover URL-path credentials today, so the host-side bridge approach is the right interim solution.

  2. blockedSandboxEnvNames is a great pattern — switching from delete of individual keys to a blocklist Set is more robust and easier to audit. Consider whether COMPATIBLE_ANTHROPIC_API_KEY and similar future credential env vars should be auto-discovered from the provider list rather than hardcoded, to prevent drift.

  3. Provider verification post-createverifyProviderExists only checks the provider exists, not that it's actually attached to the sandbox. If --provider silently fails during sandbox create, the provider might exist in the gateway but not be wired to the sandbox. Might be worth a openshell sandbox get <name> --json check to verify the provider attachment.

Happy to help test this against real Discord/Slack/Telegram bridges if useful.

…tial-providers

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

# Conflicts:
#	bin/lib/onboard.js
Rename verifyProviderExists to providerExistsInGateway to clarify it
only checks gateway-level existence, not sandbox attachment. Add
BEDROCK_API_KEY to the credential blocklist (merged via #963). Add
env var leakage assertions to the messaging provider test. Add JSDoc
to buildProviderArgs, upsertProvider, and providerExistsInGateway.
Increase messaging provider test timeout for post-merge code paths.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa marked this pull request as ready for review March 31, 2026 18:35

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/credential-exposure.test.js (1)

79-85: Test assertions are loose but follow existing patterns.

These regex checks verify the strings appear somewhere in the source file, not specifically within the blockedSandboxEnvNames Set. This could pass if a credential name only appears in a comment while removed from the blocklist.

Consider tightening the assertions to verify the blocklist structure more precisely:

🔧 Suggested improvement
-    // sandboxEnv must be built with a blocklist that strips all credential env vars
-    expect(src).toMatch(/blockedSandboxEnvNames/);
-    expect(src).toMatch(/NVIDIA_API_KEY/);
-    expect(src).toMatch(/BEDROCK_API_KEY/);
-    expect(src).toMatch(/DISCORD_BOT_TOKEN/);
-    expect(src).toMatch(/SLACK_BOT_TOKEN/);
-    expect(src).toMatch(/TELEGRAM_BOT_TOKEN/);
+    // sandboxEnv must be built with a blocklist that strips all credential env vars
+    expect(src).toMatch(/blockedSandboxEnvNames\s*=\s*new\s+Set\(\[/);
+    expect(src).toMatch(/blockedSandboxEnvNames.*"NVIDIA_API_KEY"/s);
+    expect(src).toMatch(/blockedSandboxEnvNames.*"BEDROCK_API_KEY"/s);
+    expect(src).toMatch(/blockedSandboxEnvNames.*"DISCORD_BOT_TOKEN"/s);
+    expect(src).toMatch(/blockedSandboxEnvNames.*"SLACK_BOT_TOKEN"/s);
+    expect(src).toMatch(/blockedSandboxEnvNames.*"TELEGRAM_BOT_TOKEN"/s);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/credential-exposure.test.js` around lines 79 - 85, The current
assertions only check that credential names exist somewhere in the source;
tighten them to verify those names are actually inside the
blockedSandboxEnvNames set. Update the test around blockedSandboxEnvNames to
assert the Set initializer/contents include each credential (e.g., match a
pattern where "blockedSandboxEnvNames" is assigned to a Set/array literal and
the literal text contains "NVIDIA_API_KEY", "BEDROCK_API_KEY", etc.), or parse
the src to extract the blockedSandboxEnvNames value and assert the Set contains
those keys; target the "blockedSandboxEnvNames" symbol so the credentials are
validated inside the blocklist rather than anywhere in the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/credential-exposure.test.js`:
- Around line 79-85: The current assertions only check that credential names
exist somewhere in the source; tighten them to verify those names are actually
inside the blockedSandboxEnvNames set. Update the test around
blockedSandboxEnvNames to assert the Set initializer/contents include each
credential (e.g., match a pattern where "blockedSandboxEnvNames" is assigned to
a Set/array literal and the literal text contains "NVIDIA_API_KEY",
"BEDROCK_API_KEY", etc.), or parse the src to extract the blockedSandboxEnvNames
value and assert the Set contains those keys; target the
"blockedSandboxEnvNames" symbol so the credentials are validated inside the
blocklist rather than anywhere in the file.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d8c09f02-0c95-4caa-993e-7bd03205125f

📥 Commits

Reviewing files that changed from the base of the PR and between 9d4d5e1 and 6d23a16.

📒 Files selected for processing (3)
  • bin/lib/onboard.js
  • test/credential-exposure.test.js
  • test/onboard.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/onboard.test.js

…tial-providers

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Add direct in-process tests for buildProviderArgs (all branch paths),
formatEnvAssignment, compactText, getNavigationChoice,
parsePolicyPresetEnv, summarizeCurlFailure, and summarizeProbeFailure.
Add subprocess tests for upsertProvider (create, update fallback, and
both-fail error paths), providerExistsInGateway (true/false), and
hydrateCredentialEnv (null input, stored credential, missing key).
Export newly-tested functions from onboard.js.

CLI function coverage: 33.24% (threshold 32%).

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
test/onboard.test.js (1)

1372-1504: Add a failing-upsert path test for createSandbox() provider attachment.

Current coverage is happy-path only. Please add a case where one messaging provider upsert returns { ok: false } and assert sandbox creation either aborts or skips attaching that provider. This will guard against silent broken wiring regressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/onboard.test.js` around lines 1372 - 1504, Add a new test case (another
it block) that calls createSandbox but mocks the provider upsert path to
simulate a failure: in the injected script override runner.run or
runner.runCapture (the same symbol used in the existing messaging test) so that
the provider create/upsert command returns a payload indicating failure (e.g., a
string or status that your production code interprets as { ok: false }), then
assert that createSandbox either aborts (nonzero exit) or that the provider
create command is not followed by attaching that provider to the sandbox
(inspect the captured commands and spawn env like in the existing test). Reuse
the same helpers in the script (runner.run, runner.runCapture,
registry.registerSandbox, childProcess.spawn, createSandbox) and assert the
expected behavior for the failing-upsert case (sandbox creation skipped or
provider absent).
bin/lib/onboard.js (1)

915-917: Suppress CLI probe noise in provider existence checks.

On Line 916, this uses runOpenshell() with default stdio, so negative probes can emit raw CLI errors unnecessarily. Consider piping stdio for cleaner onboarding output.

💡 Proposed tweak
 function providerExistsInGateway(name) {
-  const result = runOpenshell(["provider", "get", name], { ignoreError: true });
+  const result = runOpenshell(["provider", "get", name], {
+    ignoreError: true,
+    stdio: ["ignore", "pipe", "pipe"],
+  });
   return result.status === 0;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 915 - 917, The providerExistsInGateway
function currently calls runOpenshell(["provider", "get", name], { ignoreError:
true }) which leaves stdio at the default and allows CLI error output to leak;
change the call in providerExistsInGateway to pass a stdio option that
pipes/suppresses subprocess output (e.g., include stdio: "pipe" or equivalent in
the options object) so negative probes don't emit raw CLI errors while
preserving ignoreError: true and the existing status check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 2223-2240: The upsertProvider calls for "discord-bridge",
"slack-bridge", and "telegram-bridge" may fail but their results are ignored,
yet the code still pushes the provider names into messagingProviders and later
appends them to createArgs; modify the logic around upsertProvider (the three
calls) to capture its return/throw, verify success (e.g., truthy result or no
thrown error) before pushing the provider name into messagingProviders and
adding "--provider" to createArgs, and ensure errors from upsertProvider are
handled (log/throw/skip) so failed upserts do not result in invalid --provider
flags; relevant symbols: upsertProvider, messagingProviders, createArgs,
getCredential, hydrateCredentialEnv.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 915-917: The providerExistsInGateway function currently calls
runOpenshell(["provider", "get", name], { ignoreError: true }) which leaves
stdio at the default and allows CLI error output to leak; change the call in
providerExistsInGateway to pass a stdio option that pipes/suppresses subprocess
output (e.g., include stdio: "pipe" or equivalent in the options object) so
negative probes don't emit raw CLI errors while preserving ignoreError: true and
the existing status check.

In `@test/onboard.test.js`:
- Around line 1372-1504: Add a new test case (another it block) that calls
createSandbox but mocks the provider upsert path to simulate a failure: in the
injected script override runner.run or runner.runCapture (the same symbol used
in the existing messaging test) so that the provider create/upsert command
returns a payload indicating failure (e.g., a string or status that your
production code interprets as { ok: false }), then assert that createSandbox
either aborts (nonzero exit) or that the provider create command is not followed
by attaching that provider to the sandbox (inspect the captured commands and
spawn env like in the existing test). Reuse the same helpers in the script
(runner.run, runner.runCapture, registry.registerSandbox, childProcess.spawn,
createSandbox) and assert the expected behavior for the failing-upsert case
(sandbox creation skipped or provider absent).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8c66deaf-e3f2-435f-925a-6b11b2785952

📥 Commits

Reviewing files that changed from the base of the PR and between 6d23a16 and 1824df4.

📒 Files selected for processing (2)
  • bin/lib/onboard.js
  • test/onboard.test.js

Comment thread bin/lib/onboard.js Outdated
…tial-providers

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

# Conflicts:
#	bin/lib/onboard.js
#	test/onboard.test.js
…tial-providers

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

# Conflicts:
#	bin/lib/onboard.js
ericksoa added a commit that referenced this pull request Apr 8, 2026
When openshell sandbox create exits with SSH 255 after printing
"Created sandbox:", NemoClaw previously hard-exited instead of checking
whether the sandbox reached Ready state. This was a regression from
the create-stream extraction (#1516) combined with the messaging
provider migration path (#1081, #1527) that forces sandbox recreation.

Two fixes:
- streamSandboxCreate: do one final readyCheck on non-zero close to
  catch the race where the sandbox is already Ready when SSH dies.
- onboard.js: when failure is sandbox_create_incomplete, fall through
  to the existing ready-wait loop (60s polling) instead of exiting.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
ericksoa added a commit that referenced this pull request Apr 8, 2026
…1598)

## Summary
- When `openshell sandbox create` exits with SSH 255 after printing
"Created sandbox:", NemoClaw now treats this as recoverable instead of
hard-exiting
- Added a final `readyCheck` in `streamSandboxCreate` on non-zero close
to catch the race where the sandbox is already Ready when SSH dies
- In `onboard.js`, `sandbox_create_incomplete` failures now fall through
to the existing 60s ready-wait loop instead of calling `process.exit()`

## Root cause
Regression from #1516 (create-stream extraction) combined with
#1081/#1527 (messaging provider migration forcing sandbox recreation).
The create stream returns non-zero after "Created sandbox:" and NemoClaw
exits before checking if the sandbox reached Ready state.

## Test plan
- [x] New unit tests: stream recovers when readyCheck is true at close
time (SSH 255)
- [x] New unit test: stream still returns non-zero when readyCheck is
false at close time
- [x] All existing `sandbox-create-stream` tests pass (7/7)
- [x] All onboard integration tests pass (83/83)
- [x] All src unit tests pass (538/538)
- [ ] Manual: trigger SSH 255 during sandbox create and verify recovery

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced sandbox creation error handling with improved failure
classification and recovery messaging.
* Improved sandbox readiness detection to prevent unnecessary creation
failures when the sandbox is operational.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
dknos pushed a commit to dknos/NemoClaw that referenced this pull request Apr 8, 2026
Auto-merged: openclaw-sandbox.yaml, package.json, scripts/debug.sh

Conflict resolved:
- scripts/telegram-bridge.js: NVIDIA deleted it in NVIDIA#1081 (provider-based
  messaging), but we keep our version because the host-side bridge is
  still actively running in our PM2 (uptime >100m) and referenced by
  scripts/monitor.js health checks.

Incoming upstream highlights:
- Security: SSRF validation coverage, shell-quoted sandboxName, DELETE
  wildcard restrictions, SSH 255 recovery
- Runtime: Podman support (macOS/Linux), signal handlers in start script
- CLI: nvapi- key validation, reuse-sandbox prompts, credentials reset
- Policy presets: brew, HuggingFace router, sentry scope
- Docs: security credential storage, OpenShell lifecycle split
dknos added a commit to dknos/NemoClaw that referenced this pull request Apr 8, 2026
Auto-merged: openclaw-sandbox.yaml, package.json, scripts/debug.sh

Conflict resolved:
- scripts/telegram-bridge.js: NVIDIA deleted it in NVIDIA#1081 (provider-based
  messaging), but we keep our version because the host-side bridge is
  still actively running in our PM2 (uptime >100m) and referenced by
  scripts/monitor.js health checks.

Incoming upstream highlights:
- Security: SSRF validation coverage, shell-quoted sandboxName, DELETE
  wildcard restrictions, SSH 255 recovery
- Runtime: Podman support (macOS/Linux), signal handlers in start script
- CLI: nvapi- key validation, reuse-sandbox prompts, credentials reset
- Policy presets: brew, HuggingFace router, sentry scope
- Docs: security credential storage, OpenShell lifecycle split
gemini2026 pushed a commit to gemini2026/NemoClaw that referenced this pull request Apr 14, 2026
…IA#1081)

## Summary

Use the OpenShell provider system for messaging credential injection
instead of raw env var passthrough. Discord, Slack, and Telegram tokens
now flow through the placeholder/proxy pipeline — sandbox processes
never see real values. The host-side Telegram bridge is removed;
messaging channels are baked into `openclaw.json` at image build time
via `NEMOCLAW_MESSAGING_CHANNELS_B64`, and the L7 proxy rewrites
placeholders with real secrets at egress — no runtime config patching
needed.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

## Related Issues

Fixes NVIDIA#1109
Fixes NVIDIA#616
Fixes NVIDIA#1310
Supersedes NVIDIA#617

## Changes

- **`bin/lib/onboard.js`** — Create `generic` providers for Discord,
Slack, and Telegram tokens via `upsertProvider()`. Attach to sandbox via
`--provider` flags. Replace individual env var deletes with a
comprehensive blocklist. Bake messaging channel config into
`openclaw.json` at build time. Collect Telegram user ID for DM
allowlisting.
- **`Dockerfile`** — Accept `NEMOCLAW_MESSAGING_CHANNELS_B64` build arg
and inject channel config into `openclaw.json` at image build time.
- **`scripts/nemoclaw-start.sh`** — Remove dead runtime `openclaw.json`
patching from `configure_messaging_channels`. Allow CLI clients in
auto-pair watcher.
- **`nemoclaw/src/lib/services.ts`** — Remove stale `telegram-bridge`
spawn.
- **`scripts/telegram-bridge.js`** — Removed (replaced by native
OpenClaw channels via providers).
- **`test/onboard.test.js`** — Verify provider create commands,
`--provider` flags on sandbox create, and that real token values never
appear in the sandbox create command.
- **`test/credential-exposure.test.js`** — Updated for expanded
blocklist coverage.
- **`test/e2e/messaging-providers.test.sh`** — New E2E test: provider
creation, sandbox attachment, DM allowlisting.

## Thanks

- @sayalinvidia — tested Discord end-to-end, diagnosed that Landlock
makes `openclaw.json` immutable at runtime in non-root mode, and
proposed the build-time bake approach via
`NEMOCLAW_MESSAGING_CHANNELS_B64` that made this work (PR NVIDIA#1501)
- @mercl-lau — found the stale `telegram-bridge` spawn in `services.ts`
that silently crashed after the bridge script was removed
- @stevenrick — tested Telegram on Brev, independently confirmed the
Landlock issue, and found that the auto-pair watcher rejected CLI
clients (also opened NVIDIA#1496)

## Type of Change

- [x] Code change for a new feature, bug fix, or refactor.
- [ ] Code change with doc updates.
- [ ] Doc only. Prose changes without code sample modifications.
- [ ] Doc only. Includes code sample changes.

## Testing

- [x] `npx prek run --all-files` passes (or equivalently `make check`).
- [x] `npm test` passes.
- [ ] `make docs` builds without warnings. (for doc-only changes)
- [x] E2E validated with real bot tokens on Brev instance

## Checklist

### General

- [x] I have read and followed the [contributing
guide](https://github.com/NVIDIA/NemoClaw/blob/main/CONTRIBUTING.md).
- [ ] I have read and followed the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md).
(for doc-only changes)

### Code Changes

- [x] Formatters applied — `npx prek run --all-files` auto-fixes
formatting (or `make format` for targeted runs).
- [x] Tests added or updated for new or changed behavior.
- [x] No secrets, API keys, or credentials committed.
- [ ] Doc pages updated for any user-facing behavior changes (new
commands, changed defaults, new features, bug fixes that contradict
existing docs).

### Doc Changes

N/A

---------

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Co-authored-by: sayalinvidia <sayalinvidia@users.noreply.github.com>
gemini2026 pushed a commit to gemini2026/NemoClaw that referenced this pull request Apr 14, 2026
## Summary

- Reverts the unrelated `CLAUDE.md` modification that was inadvertently
included in the security credential injection PR (NVIDIA#1081)
- The added "Claude Behavior Rules" section was never reviewed as part
of that security change and was not requested

## Test plan

- [x] Verify `CLAUDE.md` matches its pre-NVIDIA#1081 state
- [x] No other files affected

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
  * Removed obsolete internal documentation entries.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
gemini2026 pushed a commit to gemini2026/NemoClaw that referenced this pull request Apr 14, 2026
…VIDIA#1598)

## Summary
- When `openshell sandbox create` exits with SSH 255 after printing
"Created sandbox:", NemoClaw now treats this as recoverable instead of
hard-exiting
- Added a final `readyCheck` in `streamSandboxCreate` on non-zero close
to catch the race where the sandbox is already Ready when SSH dies
- In `onboard.js`, `sandbox_create_incomplete` failures now fall through
to the existing 60s ready-wait loop instead of calling `process.exit()`

## Root cause
Regression from NVIDIA#1516 (create-stream extraction) combined with
NVIDIA#1081/NVIDIA#1527 (messaging provider migration forcing sandbox recreation).
The create stream returns non-zero after "Created sandbox:" and NemoClaw
exits before checking if the sandbox reached Ready state.

## Test plan
- [x] New unit tests: stream recovers when readyCheck is true at close
time (SSH 255)
- [x] New unit test: stream still returns non-zero when readyCheck is
false at close time
- [x] All existing `sandbox-create-stream` tests pass (7/7)
- [x] All onboard integration tests pass (83/83)
- [x] All src unit tests pass (538/538)
- [ ] Manual: trigger SSH 255 during sandbox create and verify recovery

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced sandbox creation error handling with improved failure
classification and recovery messaging.
* Improved sandbox readiness detection to prevent unnecessary creation
failures when the sandbox is operational.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
brandonpelfrey added a commit that referenced this pull request Apr 16, 2026
`nemoclaw stop` only killed the host-side `cloudflared` process but left
the OpenClaw gateway (and its Telegram/Discord/Slack messaging channels)
running inside the sandbox. Bots continued responding to messages after
stop.

After PR #1081 moved messaging bridges into the sandbox, no stop
mechanism was added to replace the old host-side process kill.

Add `stopSandboxChannels()` to `services.ts` that execs into the
sandbox via `openshell sandbox exec <name> -- pkill -TERM -f "openclaw
gateway"`. The gateway's own signal handler (`cleanup()` in
nemoclaw-start.sh) forwards SIGTERM to child processes, which stops all
messaging channel polling.

`stopAll()` now calls `stopSandboxChannels()` before stopping host-side
services. When the sandbox name is unavailable or openshell is not
installed, it warns and continues — host-side cleanup still proceeds.

- [x] `npx vitest run --project cli src/lib/services.test.ts` — 14 existing tests pass
- [x] `npx vitest run --project cli src/lib/services-sandbox.test.ts` — 9 new tests pass
- [x] `npx vitest run --project cli src/lib/services-command.test.ts` — 4 tests pass
- [x] `npm test` — full suite (1635 tests) passes
- [x] `npm run typecheck:cli` — clean
- [ ] Manual: `nemoclaw stop` → Telegram/Discord bots stop responding

Signed-off-by: Claude <noreply@anthropic.com>
brandonpelfrey added a commit that referenced this pull request Apr 16, 2026
`nemoclaw stop` only killed the host-side `cloudflared` process but left
the OpenClaw gateway (and its Telegram/Discord/Slack messaging channels)
running inside the sandbox. Bots continued responding to messages after
stop.

After PR #1081 moved messaging bridges into the sandbox, no stop
mechanism was added to replace the old host-side process kill.

Add `stopSandboxChannels()` to `services.ts` that execs into the
sandbox via `openshell sandbox exec <name> -- pkill -TERM -f "openclaw
gateway"`. The gateway's own signal handler (`cleanup()` in
nemoclaw-start.sh) forwards SIGTERM to child processes, which stops all
messaging channel polling.

`stopAll()` now calls `stopSandboxChannels()` before stopping host-side
services. When the sandbox name is unavailable or openshell is not
installed, it warns and continues — host-side cleanup still proceeds.

- [x] `npx vitest run --project cli src/lib/services.test.ts` — 14 existing tests pass
- [x] `npx vitest run --project cli src/lib/services-sandbox.test.ts` — 9 new tests pass
- [x] `npx vitest run --project cli src/lib/services-command.test.ts` — 4 tests pass
- [x] `npm test` — full suite (1635 tests) passes
- [x] `npm run typecheck:cli` — clean
- [ ] Manual: `nemoclaw stop` → Telegram/Discord bots stop responding

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>
ericksoa pushed a commit that referenced this pull request Apr 22, 2026
## Summary

Fixes #2103 where `nemoclaw stop` lied — it only stopped the cloudflared
tunnel, not Telegram/Discord bridges, which moved inside the sandbox in
#1081 and couldn't be paused without destroying the sandbox.

- **Rename** `nemoclaw stop/start` → `nemoclaw tunnel stop/start`.
Legacy `stop`/`start` kept as deprecated aliases that print a warning
and delegate.
- **New** `nemoclaw <sandbox> channels stop <channel>` and `channels
start <channel>`. Marks the channel disabled in the per-sandbox
registry, rebuilds, and the onboard code skips registering that bridge
with the gateway. Credentials stay in the keychain — non-destructive,
unlike `channels remove`.

## Why rebuild and not `openshell sandbox exec`

Aaron floated the shields-down exec-into-sandbox pattern in the #2103
thread. On closer look, that path doesn't work cleanly for this:
- `openclaw.json` is read-only at runtime (Landlock + `root:root 444`),
and NemoClaw's sandbox entrypoint explicitly intercepts `openclaw
channels add/remove` from inside with a host-side-only error.
- `openclaw gateway stop` is too coarse — kills every channel plus agent
comms.
- SIGTERM on the bridge process works but doesn't persist across a
gateway restart.
- No per-channel runtime CLI exists in OpenClaw today.

So this PR mirrors `channels remove`'s config-patch + rebuild plumbing.
A rebuild takes a few minutes and preserves workspace data, which is
strictly better than the current "delete the sandbox" workaround.

## Note on #2103's policy-remove claim

#2103's issue body said "No remove/delete/revoke logic exists anywhere
in `policies.ts`." That's stale — `removePreset()` at
`src/lib/policies.ts:278` and the `nemoclaw <sandbox> policy-remove
<preset>` CLI dispatch at `src/nemoclaw.ts:3088` already exist. No code
change needed for that line item.

## Test plan

- [x] `npx vitest run --project cli` — 1740 passed, 7 skipped (3 new
registry tests for `setChannelDisabled`/`getDisabledChannels`)
- [x] `npm run typecheck:cli` clean
- [x] Manual: help text renders new commands, `nemoclaw stop` prints
deprecation + still works, `nemoclaw tunnel stop` works directly,
`nemoclaw tunnel bogus` prints usage
- [x] Manual: `nemoclaw <sandbox> channels stop telegram` with fake
registry → registry gains `disabledChannels: ["telegram"]`; `channels
start telegram` removes the entry; dry-run honored; unknown channel
rejected
- [ ] Not yet validated against a live sandbox rebuild (the onboard-side
filter that excludes `disabledChannels` from `messagingTokenDefs` is a
one-liner, but hasn't been E2E-tested end-to-end)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
  * Added `tunnel start|stop` command for service management
* Added `channels start|stop` commands to enable/disable specific
messaging channels in sandboxes

* **Deprecations**
* Marked `nemoclaw start|stop` as deprecated; use `tunnel start|stop`
instead

* **Tests**
  * Added test coverage for channel enable/disable functionality

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
brandonpelfrey added a commit that referenced this pull request May 1, 2026
<!-- markdownlint-disable MD041 -->
## Summary

`nemoclaw stop` only killed the host-side `cloudflared` process but left
the OpenClaw gateway (and its Telegram/Discord/Slack messaging channels)
running inside the sandbox. After PR #1081 moved bridges into the
sandbox, no stop mechanism was added. This PR adds
`stopSandboxChannels()` which execs into the sandbox via `openshell
sandbox exec` to SIGTERM the gateway process, stopping all messaging
channel polling.

## Related Issue

Fixes #1825

## Changes

- Add `stopSandboxChannels()` to `src/lib/services.ts` — resolves
`openshell`, execs `pkill -TERM -f "openclaw gateway"` inside the
sandbox, and handles all pkill exit codes (0 = stopped, 1 = already
stopped, >1 = unreachable)
- Update `stopAll()` to call `stopSandboxChannels()` before stopping
host-side services; reads sandbox name from opts, `NEMOCLAW_SANDBOX`, or
`SANDBOX_NAME` env vars; warns when no sandbox name is available
- Import `spawnSync` from `node:child_process` and `resolveOpenshell`
from `./resolve-openshell`
- Add `src/lib/services-sandbox.test.ts` with 9 tests covering:
successful gateway stop, pkill exit 1 (no match), unreachable sandbox,
null status (timeout), openshell not found, stopAll with sandbox name,
stopAll without sandbox name, cloudflared cleanup on exec failure, and
env var fallback

## Type of Change

- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification

- [ ] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

## AI Disclosure

- [x] AI-assisted — tool: Claude Code

---
Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Introduced sandbox-aware shutdown flow: attempts in-sandbox gateway
cleanup when a validated sandbox name is available, prefers validated
env-provided sandbox names (with correct precedence), and logs clearer
success/warning outcomes for varied exit statuses and missing in-sandbox
tooling.

* **Tests**
* Added tests for sandbox shutdown paths, missing/failed in-sandbox
tooling, env-var vs registry name selection and validation (rejecting
unsafe names), and ensuring host-side cleanup still runs.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>
Co-authored-by: Aaron Erickson 🦞 <aerickson@nvidia.com>
ericksoa pushed a commit that referenced this pull request May 11, 2026
## Summary

Wires the existing `e2e-branch-validation.yaml` reusable workflow into
the nightly E2E pipeline as a new `brev-e2e` matrix job. This turns on
end-to-end coverage of the **real Brev cloud provisioning + launchable
bootstrap path**, which today is only exercised on manual
`workflow_dispatch`.

## Why now

Previously this couldn't run unattended because short-lived Brev refresh
tokens would expire silently — the nightly would happily \`1 skipped, 0
passed\` every night until someone noticed (see the history in #1638). A
**long-lived \`BREV_API_TOKEN\`** (CI service-account-style refresh
token) is now available, which removes that failure mode.

## What changes

### \`.github/workflows/e2e-branch-validation.yaml\`
- Re-enables the repo gate that was commented out during fork testing:
  \`\`\`yaml
  if: >-
    github.repository == 'NVIDIA/NemoClaw' ||
    github.repository == 'jyaunches/NemoClaw'
  \`\`\`
The fork is temporarily whitelisted so the long-lived token can be
validated end-to-end before rotation into the upstream secret.

### \`.github/workflows/nightly-e2e.yaml\`
- New \`brev-e2e\` job, invoked via \`uses:
./.github/workflows/e2e-branch-validation.yaml\` with a matrix of three
test suites:
- \`all\` — credential-sanitization + telegram-injection (security
regression)
  - \`messaging-providers\` — Telegram + Discord L7 proxy chain (#1081)
- \`full\` — install → onboard → inference → CLI (destroys sandbox on
exit)
- Added to \`notify-on-failure\` and \`report-to-pr\` \`needs:\` lists
- Added to header comment + \`workflow_dispatch\` valid-jobs list
- \`fail-fast: false\` so one suite's failure doesn't cancel the others

## Launchable coverage

\`use_launchable: true\` (default) — the test harness uses
\`scripts/brev-launchable-ci-cpu.sh\` as the Brev startup script. The
cloud instance comes up pre-baked with Docker, Node 22, OpenShell CLI,
and the NemoClaw repo; the E2E harness then rsyncs the target branch
over the launchable's clone and runs the suite.

This is distinct from the existing \`launchable-smoke-e2e\` job, which
validates the launchable *script* on \`ubuntu-latest\` without any Brev
cloud involvement.

## Cost

~\$0.10 per Brev CPU instance × 3 suites × 30 nights ≈ **~\$9/month**.

## Validation plan (before merge)

1. **Fork validation** — paste long-lived \`BREV_API_TOKEN\` +
\`NVIDIA_API_KEY\` into \`jyaunches/NemoClaw\` repo secrets.
2. Dispatch \`nightly-e2e.yaml\` on the fork with \`jobs: brev-e2e\` to
confirm the matrix works end-to-end on cheap Brev credits.
3. Rotate the long-lived token into \`NVIDIA/NemoClaw\` repo secrets.
4. Merge and monitor the first ~3 scheduled nightlies for token validity
+ flake rate.
5. Follow-up: remove the \`jyaunches/NemoClaw\` clause from the repo
gate once stable.

## Related

- #1638 — prior silent-skip footgun that a long-lived token now prevents
- #1328 — Flavor 2 GPU launchable (orthogonal — this PR is CPU only)
- #3263 — re-enable \`gpu-e2e\` (separate issue, not bundled here)

---

## Update (2026-05-11): switch to long-lived `--api-key` auth

Brev CI/CD team issued a proper long-lived API key. Commit `07687a6ad`
swaps the auth mechanism:

- Brev CLI pinned **v0.6.322 → v0.6.324** (first release exposing `brev
login --api-key` / `--org-id` flags)
- Secret **`BREV_API_TOKEN` → `BREV_API_KEY`** (old name was misleading
— `refresh_token` ≠ `api_key`) + new **`BREV_ORG_ID`** secret
- Workflow auth step now calls `brev login --api-key "$BREV_API_KEY"
--org-id "$BREV_ORG_ID"` instead of hand-writing
`~/.brev/credentials.json` with a refresh_token (original workaround for
brev login removal in #1470 is now obsolete)
- `workflow_dispatch` override input `brev_token` → `brev_api_key` + new
`brev_org_id`

No test-harness changes required — the CLI still populates
`credentials.json` at the same path, just with `api_key` +
`api_key_org_id` fields alongside the existing `access_token` /
`refresh_token`.

### Secret checklist (updated)

| Secret | jyaunches/NemoClaw (fork) | NVIDIA/NemoClaw (upstream) |
|---|---|---|
| `BREV_API_KEY` | ✅ set | ⏳ pending |
| `BREV_ORG_ID` | ✅ set | ⏳ pending |
| `NVIDIA_API_KEY` | ✅ set | ✅ set |
| ~~`BREV_API_TOKEN`~~ | obsolete (safe to delete) | obsolete (safe to
delete after merge) |


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
- Switched CI auth to required long‑lived API key and org ID; pinned CLI
and hardened auth verification with retries.
- Improved workflow invocation and branch resolution logic; added inputs
to toggle/use published launchable and specify its ID.
- Restricted workflow execution scope to upstream repo and specific CI
fork.

* **Tests**
  - Added nightly Brev E2E job.
- Enabled published-launchable provisioning with a startup‑script
fallback.
- Improved instance readiness checks and error handling; preserved
teardown.

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3350)
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added area: install Install, setup, prerequisites, or uninstall flow area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow bug-fix PR fixes a bug or regression and removed Getting Started labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: install Install, setup, prerequisites, or uninstall flow area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow bug-fix PR fixes a bug or regression security Potential vulnerability, unsafe behavior, or access risk

Projects

None yet

6 participants