fix(channels): detach provider + un-apply preset on channels remove#3672
Conversation
…annels-remove-operation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughapplyChannelRemoveToGatewayAndRegistry detaches bridge providers before deletion and captures structured detach/delete failures; removeSandboxChannel un-applies a matching built-in policy preset when present; E2E tests and CI job expanded to exercise stop/start/remove lifecycles across agents and messaging channels; onboarding now returns [] instead of null for an explicit empty resume channel set. ChangesChannel Removal Flow Improvements
Possibly related issues
Possibly Related PRs
Suggested labels
Suggested Reviewers
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
Selective E2E Results — ✅ All requested jobs passedRun: 26010869634
|
Selective E2E Results — ✅ All requested jobs passedRun: 26014243607
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/e2e/test-channels-stop-start.sh (1)
436-447: ⚡ Quick winReduce brittle coupling to human-readable CLI text in assertions.
C7a/C7b/C7f depend on exact output phrases; that can fail on wording changes even when behavior is correct. Prefer exit-code + state-based checks (you already have strong state validation in C7c–C8c).
Proposed hardening
-if [ "$remove_rc" -eq 0 ] && grep -q "Removed telegram bridge" /tmp/nc-remove.log; then +if [ "$remove_rc" -eq 0 ]; then pass "C7a: channels remove telegram completed on a live sandbox" else fail "C7a: channels remove telegram failed" tail -30 /tmp/nc-remove.log 2>/dev/null || true fi -if grep -q "Change queued.*remove 'telegram'" /tmp/nc-remove.log; then - pass "C7b: channels remove queued rebuild" -else - fail "C7b: channels remove did not queue a rebuild" -fi +pass "C7b: rebuild requirement validated by Phase 8 post-remove rebuild checks" if nemoclaw "$SANDBOX_NAME" policy-list >/tmp/nc-policy-list-after-remove.log 2>&1; then if grep -q "● telegram" /tmp/nc-policy-list-after-remove.log; then fail "C7f: telegram policy preset still active after channels remove" grep "telegram" /tmp/nc-policy-list-after-remove.log | head -5 || true - elif grep -q "telegram" /tmp/nc-policy-list-after-remove.log; then - pass "C7f: telegram policy preset no longer active after channels remove" else - fail "C7f: policy-list did not include telegram preset status" - tail -30 /tmp/nc-policy-list-after-remove.log 2>/dev/null || true + pass "C7f: telegram policy preset not active after channels remove" fi else fail "C7f: policy-list failed after channels remove" tail -30 /tmp/nc-policy-list-after-remove.log 2>/dev/null || true fiAlso applies to: 471-480
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/test-channels-stop-start.sh` around lines 436 - 447, Replace brittle greps against human-readable CLI text in the C7a/C7b assertions: instead of checking for exact log phrases in /tmp/nc-remove.log, assert success via the command exit code (remove_rc == 0) and verify the actual system state or API (e.g., confirm the telegram channel is absent via the same state-checks used in C7c–C8c or query the service to ensure the removal and that a rebuild task is queued) so update the test to use remove_rc and the existing state-validation functions/commands rather than grep "Removed telegram bridge" or "Change queued.*remove 'telegram'".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/e2e/test-channels-stop-start.sh`:
- Around line 436-447: Replace brittle greps against human-readable CLI text in
the C7a/C7b assertions: instead of checking for exact log phrases in
/tmp/nc-remove.log, assert success via the command exit code (remove_rc == 0)
and verify the actual system state or API (e.g., confirm the telegram channel is
absent via the same state-checks used in C7c–C8c or query the service to ensure
the removal and that a rebuild task is queued) so update the test to use
remove_rc and the existing state-validation functions/commands rather than grep
"Removed telegram bridge" or "Change queued.*remove 'telegram'".
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 109560e1-fafb-4a00-9521-a304a071c6cb
📒 Files selected for processing (1)
test/e2e/test-channels-stop-start.sh
Selective E2E Results — ✅ All requested jobs passedRun: 26014257036
|
Selective E2E Results — ✅ All requested jobs passedRun: 26014800525
|
Selective E2E Results — ❌ Some jobs failedRun: 26015011736
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/e2e/test-channels-stop-start.sh (1)
237-266: 💤 Low valueConsider extracting repeated grep patterns into variables.
The
channel_presence()function contains multiple similar grep patterns (lines 246, 249, 252, 255). Extracting these as variables would improve maintainability.Example refactor
channel_presence() { local channel="$1" local out if [ "$ACTIVE_AGENT" = "openclaw" ]; then out=$(sandbox_exec "python3 -c 'import json,sys; d=json.load(open(\"/sandbox/.openclaw/openclaw.json\")); print(\"yes\" if sys.argv[1] in d.get(\"channels\", {}) else \"no\")' '$channel'" | tail -1) || true else - local probe + local probe_telegram='grep -Eq "^TELEGRAM_BOT_TOKEN=openshell:resolve:env:TELEGRAM_BOT_TOKEN$" /sandbox/.hermes/.env' + local probe_discord='grep -Eq "^DISCORD_BOT_TOKEN=openshell:resolve:env:DISCORD_BOT_TOKEN$" /sandbox/.hermes/.env' + local probe_wechat='grep -Eq "^WEIXIN_TOKEN=openshell:resolve:env:WECHAT_BOT_TOKEN$" /sandbox/.hermes/.env' + local probe_slack='grep -Eq "^SLACK_BOT_TOKEN=xoxb-OPENSHELL-RESOLVE-ENV-SLACK_BOT_TOKEN$" /sandbox/.hermes/.env && grep -Eq "^SLACK_APP_TOKEN=xapp-OPENSHELL-RESOLVE-ENV-SLACK_APP_TOKEN$" /sandbox/.hermes/.env' + local probe case "$channel" in - telegram) - probe='grep -Eq "^TELEGRAM_BOT_TOKEN=openshell:resolve:env:TELEGRAM_BOT_TOKEN$" /sandbox/.hermes/.env' - ;; + telegram) probe="$probe_telegram" ;; + discord) probe="$probe_discord" ;; + wechat) probe="$probe_wechat" ;; + slack) probe="$probe_slack" ;; - discord) - probe='grep -Eq "^DISCORD_BOT_TOKEN=openshell:resolve:env:DISCORD_BOT_TOKEN$" /sandbox/.hermes/.env' - ;; - wechat) - probe='grep -Eq "^WEIXIN_TOKEN=openshell:resolve:env:WECHAT_BOT_TOKEN$" /sandbox/.hermes/.env' - ;; - slack) - probe='grep -Eq "^SLACK_BOT_TOKEN=xoxb-OPENSHELL-RESOLVE-ENV-SLACK_BOT_TOKEN$" /sandbox/.hermes/.env && grep -Eq "^SLACK_APP_TOKEN=xapp-OPENSHELL-RESOLVE-ENV-SLACK_APP_TOKEN$" /sandbox/.hermes/.env' - ;; esac out=$(sandbox_exec "if [ -r /sandbox/.hermes/.env ]; then if ${probe}; then echo yes; else echo no; fi; else echo missing; fi" | tail -1) || true fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/test-channels-stop-start.sh` around lines 237 - 266, The channel_presence() function repeats nearly identical grep patterns for each channel; refactor by defining reusable pattern variables (e.g., base_env="/sandbox/.hermes/.env", telegram_pat, discord_pat, wechat_pat, slack_pat) or a single template and then assign probe to that variable inside each case (cases: telegram, discord, wechat, slack) so sandbox_exec uses the probe variable as before; update the slack case to combine the two slack patterns into a single probe variable as well and keep the remaining logic (sandbox_exec call and out handling) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/test-channels-stop-start.sh`:
- Around line 64-73: Remove the duplicate helper functions pass_msg() and
fail_msg() and consolidate to use the existing pass() and fail() helpers
instead: delete the pass_msg and fail_msg function definitions (which are exact
copies of pass and fail) and replace all calls to pass_msg with pass and
fail_msg with fail throughout the script (e.g., the occurrences noted around
lines 291, 294, 297, 310, 313, 316). Ensure the PASS/FAIL/TOTAL counters and
formatting remain handled by the original pass() and fail() functions.
---
Nitpick comments:
In `@test/e2e/test-channels-stop-start.sh`:
- Around line 237-266: The channel_presence() function repeats nearly identical
grep patterns for each channel; refactor by defining reusable pattern variables
(e.g., base_env="/sandbox/.hermes/.env", telegram_pat, discord_pat, wechat_pat,
slack_pat) or a single template and then assign probe to that variable inside
each case (cases: telegram, discord, wechat, slack) so sandbox_exec uses the
probe variable as before; update the slack case to combine the two slack
patterns into a single probe variable as well and keep the remaining logic
(sandbox_exec call and out handling) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7817fe34-032c-4ab6-8077-4b740245b1ed
📒 Files selected for processing (6)
.github/workflows/nightly-e2e.yamlsrc/lib/onboard/messaging-reuse.test.tssrc/lib/onboard/messaging-reuse.tstest/e2e/docs/parity-inventory.generated.jsontest/e2e/docs/parity-map.yamltest/e2e/test-channels-stop-start.sh
| pass_msg() { | ||
| ((PASS++)) | ||
| ((TOTAL++)) | ||
| printf '\033[32m PASS: %s\033[0m\n' "$1" | ||
| } | ||
| fail_msg() { | ||
| ((FAIL++)) | ||
| ((TOTAL++)) | ||
| printf '\033[31m FAIL: %s\033[0m\n' "$1" | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Eliminate duplicate helper functions.
pass_msg() and fail_msg() (lines 64–73) are exact duplicates of pass() and fail() (lines 44–53). This violates DRY and creates maintenance overhead.
♻️ Refactor to remove duplicates
Remove the duplicate functions and use the original pass() and fail() throughout the script:
-pass_msg() {
- ((PASS++))
- ((TOTAL++))
- printf '\033[32m PASS: %s\033[0m\n' "$1"
-}
-fail_msg() {
- ((FAIL++))
- ((TOTAL++))
- printf '\033[31m FAIL: %s\033[0m\n' "$1"
-}
-Then replace all calls to pass_msg with pass and fail_msg with fail throughout the file (lines 291, 294, 297, 310, 313, 316, etc.).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/test-channels-stop-start.sh` around lines 64 - 73, Remove the
duplicate helper functions pass_msg() and fail_msg() and consolidate to use the
existing pass() and fail() helpers instead: delete the pass_msg and fail_msg
function definitions (which are exact copies of pass and fail) and replace all
calls to pass_msg with pass and fail_msg with fail throughout the script (e.g.,
the occurrences noted around lines 291, 294, 297, 310, 313, 316). Ensure the
PASS/FAIL/TOTAL counters and formatting remain handled by the original pass()
and fail() functions.
Selective E2E Results — ❌ Some jobs failedRun: 26019980084
|
Selective E2E Results — ❌ Some jobs failedRun: 26020026934
|
Selective E2E Results — ❌ Some jobs failedRun: 26030365203
|
Selective E2E Results — ✅ All requested jobs passedRun: 26034610108
|
Selective E2E Results — ✅ All requested jobs passedRun: 26034693378
|
ericksoa
left a comment
There was a problem hiding this comment.
Approved after the focused channels lifecycle nightly passed on head f86f13d.
Validation: channels-stop-start-e2e passed in https://github.com/NVIDIA/NemoClaw/actions/runs/26034610108. The job covered both OpenClaw and Hermes across telegram, discord, wechat, and slack, including stop/start/remove plus rebuild verification.
## Summary Updates the NemoClaw documentation for the v0.0.45 release by summarizing the user-facing changes merged since v0.0.44 and bumping the docs version metadata. Refreshes generated user skills so agent-facing references match the source docs. ## Changes - Added v0.0.45 release notes covering onboarding recovery, local inference, channel cleanup, share mount diagnostics, uninstall cleanup, and security redaction updates. - Updated command and troubleshooting docs for sandbox name limits, GPU gateway reuse, DNS preflight behavior, channel removal cleanup, and share mount path validation. - Bumped docs version metadata to 0.0.45 and regenerated NemoClaw user skills from the docs. - Source summary: #3672 -> `docs/reference/commands.md`: documented channel removal detaching bridge providers and un-applying channel policy presets. - Source summary: #3678 -> `docs/about/release-notes.md`: documented Ollama streamed usage accounting in the release notes. - Source summary: #3670 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`: documented safe GPU gateway replacement behavior. - Source summary: #3664 -> `docs/about/release-notes.md`: documented blueprint permission normalization in the release notes. - Source summary: #3181 -> `docs/reference/troubleshooting.md`: documented GPU toolkit guidance when host drivers work but passthrough is disabled. - Source summary: #3554 -> `docs/about/release-notes.md`: documented host `openshell-gateway` cleanup during uninstall. - Source summary: #3651 -> `docs/reference/troubleshooting.md`: documented the uncached `.invalid` DNS preflight probe. - Source summary: #3643 -> `docs/reference/commands.md`: included existing `NEMOCLAW_PROVIDER` interactive-mode behavior in generated docs. - Source summary: #3647 -> `docs/reference/commands.md`: documented remote sandbox path verification for `share mount`. - Source summary: #3646 -> `docs/reference/commands.md`: included existing local writable mount target guidance in generated docs. - Source summary: #3642 -> `docs/inference/use-local-inference.md`, `docs/reference/commands.md`: documented managed-vLLM model override and gated-model token checks. - Source summary: #3639 -> `docs/reference/commands.md`: documented the 63-character sandbox name limit. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Commit hooks passed for the staged files. A standalone `npx prek run --all-files` attempt was blocked by sandbox access to `/Users/miyoungc/.cache/prek/prek.log`, so that checkbox is left unchecked. --- <!-- DCO sign-off required by CI. Run: git config user.name && git config user.email --> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Enhanced CLI command reference documentation with clearer guidance on onboarding, GPU passthrough, inference configuration, channel removal, and shared mounts. * Improved troubleshooting sections with better DNS resolution and GPU passthrough remediation steps. * Added documentation for overriding managed vLLM model selection. * Updated release notes for v0.0.45 reflecting infrastructure and workflow improvements. * **Version Bump** * Released v0.0.45. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3755?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- markdownlint-disable MD041 --> ## Summary Add Test 2 from #3462 — end-to-end coverage for the `onboard empty → channels add → channels remove` lifecycle. The test asserts against the baked sandbox image (`openclaw.json`) and the gateway (provider attachment, policy presets) so regressions cannot hide behind a correct host-side registry. Companion to Test 1 (`test-channels-stop-start.sh`) shipped via #3532. ## Related Issue Closes #3462 ## Changes - New script `test/e2e/test-channels-add-remove.sh` covering Test 2 spec from #3462: - **Phase 1**: install + onboard with no messaging channel; verifies baseline is clean (no bridge provider, no `telegram` in `openclaw.json`, no telegram preset applied). - **Phase 3 + 4**: `channels add telegram` + rebuild; asserts the matching preset auto-applied (#3437 regression gate), bridge provider exists, `openclaw.json` contains the channel, and the L7 proxy lets bridge-style traffic through to `api.telegram.org`. - **Phase 5 + 6**: `channels remove telegram` + rebuild; asserts the bridge is gone from `openclaw.json`, provider deleted from the gateway, and the matching preset un-applied (#3671 regression gate for the detach+delete + symmetric cleanup fix shipped in #3672). - Egress probe (`telegram_egress_open`) uses `node -e fetch` against a `/bot*/...` path so it matches the telegram preset's binary + path scoping. A naïve `curl /` probe would trip both guards and produce a misleading 403, hiding real regressions. - Helper `print_policy_list` snapshots the gateway's policy-list output before every `policy_list_has_preset` assertion so the test transcript records the actual gateway state alongside each pass/fail line. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification <!-- Check each item you ran and confirmed. Leave unchecked items you skipped. Doc-only changes do not require npm test unless you ran it. --> - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [ ] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- <!-- DCO sign-off required by CI. Run: git config user.name && git config user.email --> Signed-off-by: Hung Le <hple@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added E2E test covering full add/remove lifecycle for the Telegram messaging channel with environment prechecks, sandbox setup/teardown, baseline absence checks, add+rebuild and remove+rebuild verification, policy/preset/config assertions, and live egress probing through the L7 proxy. * **Documentation** * Added parity-map and generated test documentation entries describing the new Telegram add/remove scenario and its assertions. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3696?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Make
nemoclaw <sandbox> channels remove <channel>a clean inverse ofchannels add. Two asymmetries surfaced together on the same path: the bridge provider delete fails withFailedPreconditionagainst any live sandbox because the code never detached the provider first, and the channel-named policy preset is never un-applied even thoughchannels addauto-applies it.Related Issue
Fixes #3671
Changes
applyChannelRemoveToGatewayAndRegistrynow callsopenshell sandbox provider detach <sandbox> <bridge>beforeopenshell provider delete, matching how openshell enforces attachment as a precondition for delete.NotFound/NotAttachedare treated as success-equivalent so the path stays idempotent across retries.removeChannelPresetIfPresentmirrors the existingapplyChannelPresetIfAvailable.removeSandboxChannelun-applies the channel-named policy preset after the bridge is gone, so the L7 proxy stops allow-listing its upstream API (defense-in-depth — bridge is gone, egress toapi.telegram.orgshould follow). Warns but does not abort if the un-apply fails; bridge teardown is already committed by then.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Hung Le hple@nvidia.com
Summary by CodeRabbit
Bug Fixes
Tests
Chores