fix(rebuild): preserve disabledChannels across destroy/recreate#3532
Conversation
When TC-DEPLOY-01a (URL never surfaces) or TC-DEPLOY-01b (URL serves non-200) fails, CI only logs the bare assertion. That makes it impossible to tell from artifacts whether cloudflared crashed, the local dashboard was unreachable, or Cloudflare's edge returned 502. Refs issue NVIDIA#3494. Dump on failure: cloudflared.log tail, pid file + ps state, full nemoclaw status, and a verbose origin curl against the dashboard port parsed from status output. All steps are best-effort so missing artifacts never break the rest of the dump. Signed-off-by: San Dang <sdang@nvidia.com>
Lets maintainers run this single E2E job against an arbitrary branch or SHA without dispatching the entire nightly-e2e matrix. Useful for iterating on the new tunnel diagnostics dump. Refs issue NVIDIA#3494. Signed-off-by: San Dang <sdang@nvidia.com>
Drop the `if: github.repository == 'NVIDIA/NemoClaw'` gate so the manual-dispatch workflow can run on forks for branch validation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`rebuildSandbox` deleted the registry entry before invoking
`onboard --resume`, so the `disabledChannels` filter in `createSandbox`
read back an empty list and the stopped channel was rewired into the
rebuilt image. Mirror the paused set through the onboard session — the
same pattern `messagingChannels` already uses to survive the
destroy/recreate window.
- Add `disabledChannels` to the Session type with full round-trip
support (createSession, normalizeSession, filterSafeUpdates).
- Snapshot the pre-destroy `disabledChannels` and stash it into the
session alongside `messagingChannels` before invoking
`onboard --resume`.
- In `createSandbox`, prefer `session.disabledChannels` over the
(now-empty) registry copy. Fresh-onboard path still reads from
the registry via the `??` fallback.
Adds `test/e2e/test-channels-stop-start.sh` covering Test 1 from
issue NVIDIA#3462 (telegram-only). Phase 4 contains the load-bearing
regression check: openclaw.json must exclude `telegram` after
`channels stop` + rebuild. Wired into `nightly-e2e.yaml` and the
three downstream aggregator jobs.
Fixes NVIDIA#3453.
Refs NVIDIA#3381, NVIDIA#3462.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: San Dang <sdang@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (9)
🚧 Files skipped from review as they are similar to previous changes (7)
📝 WalkthroughWalkthroughPreserves sandbox ChangesDisabled-channels preservation, onboarding, rebuild, tests, E2E, and CI
Sequence DiagramsequenceDiagram
participant Dev
participant Onboard
participant ChannelState
participant Session
participant Registry
participant Rebuild
participant Sandbox_openclaw
Dev->>Onboard: createSandbox(sandboxName)
Onboard->>ChannelState: resolveDisabledChannels(sandboxName)
ChannelState->>Session: read disabledChannels from session mirror
alt session has array (including empty)
ChannelState-->>Onboard: return session disabledChannels
else session missing or null
ChannelState->>Registry: getRegistryDisabledChannels(sandboxName)
Registry-->>ChannelState: return registry disabledChannels
ChannelState-->>Onboard: return registry value
end
Dev->>Rebuild: rebuildSandbox(sandboxName)
Rebuild->>Registry: read sb.disabledChannels and remove sandbox entry
Rebuild->>Rebuild: snapshot and filter to rebuildDisabledChannels
Rebuild->>Session: updateSession with rebuildDisabledChannels
Rebuild->>Registry: preserveFields update with rebuildDisabledChannels
Rebuild->>Sandbox_openclaw: recreate sandbox using preserved state
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
The first cut of the session-stash fix used a `registry ?? session ?? []` fallback chain to compute `rebuildDisabledChannels`. That works for stop (registry has data → wins) but breaks for start: after `channels start X` clears `registry.disabledChannels`, the chain falls through to the session value from the *prior* stop+rebuild and re-stashes ["X"] back into the session. The next `onboard --resume` then reads the stale session value, filters X out of the new image, and the bridge stays dormant — the exact opposite of what `channels start` is supposed to do. `sb` is loaded fresh from the registry at the top of rebuildSandbox, before any destroy, so it's authoritative on every invocation. Drop the session fallback and re-stash from `sb` directly. The session mirror is downstream of the registry; carrying a value from a prior cycle through it would always be wrong. Caught by the new e2e test's Phase 6 (C6a/C6b/C6c failed before this fix, pass after). Also drops the dead INSTALL_OK variable from test/e2e/test-channels-stop-start.sh that shellcheck flagged (SC2034). Cargo-culted from test-token-rotation.sh where it gates skip-phases; this script exits via print_summary immediately on install failure instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: San Dang <sdang@nvidia.com>
`INSTALL_OK` was removed from test-channels-stop-start.sh; line numbers for every assertion in that file shifted by 1. Regenerated inventory via `scripts/e2e/extract-legacy-assertions.ts`. No assertion text changed, so `parity-map.yaml` is unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: San Dang <sdang@nvidia.com>
Signed-off-by: San Dang <sdang@nvidia.com>
Signed-off-by: San Dang <sdang@nvidia.com>
Signed-off-by: San Dang <sdang@nvidia.com>
Signed-off-by: San Dang <sdang@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 379-385: The workflow adds a new dispatchable job
channels-stop-start-e2e but the workflow_dispatch.inputs.jobs valid-list wasn't
updated; update the inputs.jobs enumeration (the list that declares valid job
IDs for selective dispatch) to include "channels-stop-start-e2e" so the
contains() check and selective-dispatch contract remain in sync with the new job
id.
- Around line 379-412: Add a path_instructions mapping for the new GitHub
Actions job channels-stop-start-e2e in .coderabbit.yaml so the advisory check
(test/validate-e2e-coverage.test.ts) can find which source files this E2E job
covers; update the path_instructions section to include a key matching
channels-stop-start-e2e and list the relevant source globs such as
src/lib/messaging/**, src/agents/** (or any channel/telegram-related plugins and
configs) that this test exercises so reviewers and the validation test can
correlate changes to the correct E2E job.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 426a00fd-a536-4818-a09e-3f8e37a4f4df
📒 Files selected for processing (11)
.github/workflows/nightly-e2e.yamlsrc/lib/actions/inference-set.test.tssrc/lib/actions/sandbox/rebuild.tssrc/lib/onboard.tssrc/lib/onboard/channel-state.test.tssrc/lib/onboard/channel-state.tssrc/lib/state/onboard-session.test.tssrc/lib/state/onboard-session.tstest/e2e/docs/parity-inventory.generated.jsontest/e2e/docs/parity-map.yamltest/e2e/test-channels-stop-start.sh
# Conflicts: # src/lib/actions/sandbox/rebuild.ts
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/actions/sandbox/rebuild.ts (1)
487-487:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPersist
disabledChannelsbefore deleting the registry entry.
rebuildDisabledChannelsis derived from pre-delete state, but the session mirror is not written until after Line 487. If the process dies in that gap, a lateronboard --resumewill still read the old session and re-enable previously stopped channels. Please move thedisabledChannelssession write ahead of the destructive delete, or add a small pre-delete session update for this field.Also applies to: 529-547
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/actions/sandbox/rebuild.ts` at line 487, The code currently calls removeSandboxRegistryEntry(sandboxName) before persisting the updated disabled channels, which risks losing the pre-delete disabledChannels if the process dies; change the order so the session mirror is updated with rebuildDisabledChannels (or perform a small pre-delete session update for the disabledChannels field) prior to calling removeSandboxRegistryEntry(sandboxName). Locate the rebuildDisabledChannels variable/logic and the session write code that writes the session mirror in this file (and the similar block around the 529-547 area) and ensure you persist disabledChannels to the session store first, then perform the destructive registry delete.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/lib/actions/sandbox/rebuild.ts`:
- Line 487: The code currently calls removeSandboxRegistryEntry(sandboxName)
before persisting the updated disabled channels, which risks losing the
pre-delete disabledChannels if the process dies; change the order so the session
mirror is updated with rebuildDisabledChannels (or perform a small pre-delete
session update for the disabledChannels field) prior to calling
removeSandboxRegistryEntry(sandboxName). Locate the rebuildDisabledChannels
variable/logic and the session write code that writes the session mirror in this
file (and the similar block around the 529-547 area) and ensure you persist
disabledChannels to the session store first, then perform the destructive
registry delete.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b1385b12-0ff6-40ce-ab98-50c11f036dd8
📒 Files selected for processing (3)
src/lib/actions/sandbox/rebuild.tssrc/lib/onboard.tstest/e2e/docs/parity-inventory.generated.json
✅ Files skipped from review due to trivial changes (1)
- test/e2e/docs/parity-inventory.generated.json
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/onboard.ts
<!-- markdownlint-disable MD041 --> ## Summary Add Test 2 from #3462 — end-to-end coverage for the `onboard empty → channels add → channels remove` lifecycle. The test asserts against the baked sandbox image (`openclaw.json`) and the gateway (provider attachment, policy presets) so regressions cannot hide behind a correct host-side registry. Companion to Test 1 (`test-channels-stop-start.sh`) shipped via #3532. ## Related Issue Closes #3462 ## Changes - New script `test/e2e/test-channels-add-remove.sh` covering Test 2 spec from #3462: - **Phase 1**: install + onboard with no messaging channel; verifies baseline is clean (no bridge provider, no `telegram` in `openclaw.json`, no telegram preset applied). - **Phase 3 + 4**: `channels add telegram` + rebuild; asserts the matching preset auto-applied (#3437 regression gate), bridge provider exists, `openclaw.json` contains the channel, and the L7 proxy lets bridge-style traffic through to `api.telegram.org`. - **Phase 5 + 6**: `channels remove telegram` + rebuild; asserts the bridge is gone from `openclaw.json`, provider deleted from the gateway, and the matching preset un-applied (#3671 regression gate for the detach+delete + symmetric cleanup fix shipped in #3672). - Egress probe (`telegram_egress_open`) uses `node -e fetch` against a `/bot*/...` path so it matches the telegram preset's binary + path scoping. A naïve `curl /` probe would trip both guards and produce a misleading 403, hiding real regressions. - Helper `print_policy_list` snapshots the gateway's policy-list output before every `policy_list_has_preset` assertion so the test transcript records the actual gateway state alongside each pass/fail line. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification <!-- Check each item you ran and confirmed. Leave unchecked items you skipped. Doc-only changes do not require npm test unless you ran it. --> - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [ ] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- <!-- DCO sign-off required by CI. Run: git config user.name && git config user.email --> Signed-off-by: Hung Le <hple@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added E2E test covering full add/remove lifecycle for the Telegram messaging channel with environment prechecks, sandbox setup/teardown, baseline absence checks, add+rebuild and remove+rebuild verification, policy/preset/config assertions, and live egress probing through the L7 proxy. * **Documentation** * Added parity-map and generated test documentation entries describing the new Telegram add/remove scenario and its assertions. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3696?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
channels stop <ch>followed by a rebuild left the channel running: the rebuild path destroyed the registry entry beforeonboard --resumeread backdisabledChannels, so the filter that should have excluded the stopped channel from the new image was a silent no-op. This isthe complementary half of #3395 — that PR fixed the same destroy/recreate gap on the write side (preserving
messagingChannelssochannels startcould recover, #3381); this one fixes it on the read side (honoringdisabledChannelssochannels stopactuallydisables, #3453). After both fixes the registry, gateway, and image agree at rest.
Related Issue
Fixes #3453. Refs #3381, #3395, #3462.
Changes
src/lib/state/onboard-session.ts— adddisabledChannels: string[] | nulltoSession,SessionUpdates,createSession,normalizeSession, andfilterSafeUpdates(mirrors the existingmessagingChannelsround-trip).src/lib/actions/sandbox/rebuild.ts— snapshotsb.disabledChannels(with session fallback for cross-process resumes) and stash it into the session inside the sameupdateSessionblock that already mirrorsmessagingChannels, beforeremoveSandboxRegistryEntrywipes theregistry.
src/lib/onboard.ts—createSandboxpreferssession.disabledChannelsoverregistry.getDisabledChannels(sandboxName); the registry fallback preserves the fresh-onboard path that fix(onboard): preserve disabled channels through rebuild so channels start can recover #3395 already exercises.src/lib/state/onboard-session.test.ts— 4 new cases covering round-trip, JSON filter, default-null on fresh sessions, andfilterSafeUpdatesaccepting array + explicitnullclear.src/lib/actions/inference-set.test.ts— fixture updated for the new requiredSessionfield.test/e2e/test-channels-stop-start.sh— new e2e regression test covering Test 1 of test(messaging): add E2E coverage for channels stop/start and channels add/remove rebuild flows #3462. 47 stablePASS:/FAIL:assertions across 6 phases; Phase 4 C4a is the load-bearing check for bug(messaging)channels stop <name>does not stop the bridge through rebuild —openclaw.jsonstill bakes the channel in #3453 (openclaw.jsonexcludestelegramafter stop+rebuild), Phase 6 C6 assertionsguard against regression of fix(onboard): preserve disabled channels through rebuild so channels start can recover #3395's [Messaging] Cannot start a channel after a stop #3381 fix.
test/e2e/docs/parity-map.yaml+parity-inventory.generated.json— entry for the new test, regenerated inventory..github/workflows/nightly-e2e.yaml— newchannels-stop-start-e2ejob (mirrorsmessaging-providers-e2e), wired intonotify-on-failure,report-to-pr, andscorecardaggregatorneeds:lists.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: San Dang sdang@nvidia.com
Summary by CodeRabbit
New Features
Tests
Chores