fix: clean up Docker volumes after failed gateway start#337
fix: clean up Docker volumes after failed gateway start#337
Conversation
Closes #17. When openshell gateway start fails, it leaves behind a Docker volume (openshell-cluster-nemoclaw) in a corrupted state. Rerunning onboard hits "Corrupted cluster state" because openshell gateway destroy doesn't remove the volume. Extracts a destroyGateway() helper that removes both the gateway and its Docker volumes. Called before start (pre-cleanup), after start failure, and after health check failure. Adds the same volume cleanup to uninstall.sh and scripts/setup.sh.
Static analysis tests that verify Docker volume cleanup exists in all three code paths (onboard.js, uninstall.sh, setup.sh) and that destroyGateway() is called on pre-start, start failure, and health check failure. Will catch regressions in CI. Refs: #17
📝 WalkthroughWalkthroughAdds reusable gateway cleanup that destroys the named OpenShell gateway and removes related Docker volumes during gateway startup, start-failure, health-check failure, setup failure, and uninstall; adds a test verifying the presence of volume cleanup and updates preset tests to use Vitest assertions. Changes
Sequence Diagram(s)sequenceDiagram
participant Onboard as Onboard script (bin/lib/onboard.js)
participant GatewayCLI as "openshell gateway CLI"
participant Docker as Docker (volumes)
participant Health as Gateway Health Check
Onboard->>GatewayCLI: run destroyGateway() (pre-start cleanup)
GatewayCLI-->>Onboard: returns status
Onboard->>GatewayCLI: run "gateway start -g nemoclaw"
GatewayCLI-->>Onboard: startResult (status)
alt startResult.status != 0
Onboard->>GatewayCLI: run destroyGateway()
Onboard->>Docker: remove openshell-cluster-nemoclaw volume(s)
Onboard-->>User: "Stale state removed. Please rerun: nemoclaw onboard"
else start succeeded
Onboard->>Health: poll health (1..5)
alt health == Connected
Health-->>Onboard: Connected
else after retries fail
Onboard->>GatewayCLI: run destroyGateway()
Onboard->>Docker: remove openshell-cluster-nemoclaw volume(s)
Onboard-->>User: "Stale state removed. Please rerun: nemoclaw onboard"
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 125-130: The destroyGateway function currently calls run(...) with
xargs -r (GNU-only) and swallows all errors via { ignoreError: true } and ||
true; remove the -r flag so xargs is portable on BSD/macOS, and change
destroyGateway (and its two run invocations that reference GATEWAY_NAME) to
propagate success/failure (e.g., return boolean or throw) instead of always
ignoring errors so callers can detect cleanup failures and show the manual
recovery command; update call sites that assume cleanup always succeeded to
check destroyGateway's result and display the manual docker volume rm
instruction when it fails.
In `@scripts/setup.sh`:
- Around line 77-97: The script uses the GNU-only xargs -r flag in the docker
volume removal pipelines (the commands that start with "docker volume ls -q
--filter \"name=openshell-cluster-nemoclaw\" | xargs -r docker volume rm ...")
which breaks on macOS/BSD; remove the -r flag so the pipeline becomes portable,
keeping the surrounding stderr redirection and the existing cleanup flow that
calls openshell gateway destroy -g nemoclaw, docker volume rm, warn and fail
unchanged to preserve behavior when cleanup is needed.
- Around line 80-85: The current check uses the exit of grep on the piped output
to decide success; instead run the openshell gateway start command and capture
both its exit code and output (e.g., capture output into a variable or tee it to
a temp log), then use the actual exit code of openshell (saved into a variable
like start_status or via PIPESTATUS[0]) to decide whether to run the cleanup;
you may still grep the captured output for logging/diagnostics but ensure the
cleanup block (the openshell gateway destroy and docker volume rm calls) is
executed only when the openshell gateway start exit status indicates failure.
In `@test/gateway-cleanup.test.js`:
- Around line 17-62: Tests only check source text and call counts; add an
integration-style test that actually exercises a failed gateway
start/health-check and asserts the cleanup command runs. Create a new test that
launches the real onboarding process (bin/lib/onboard.js or the bin script) via
child_process.spawn or fork with an env var that forces startGateway to fail (or
put a fake docker binary early in PATH that records invocations), wait for the
failure path, and then assert that destroyGateway was invoked by checking
captured stdout/stderr for the docker volume removal command (or that the fake
docker binary was called with "volume rm openshell-cluster") and/or that
remove_related_docker_volumes / uninstall.sh cleanup logic was executed;
reference startGateway, destroyGateway, remove_related_docker_volumes,
onboard.js, uninstall.sh, and scripts/setup.sh so the test locates and exercises
the real cleanup code rather than just matching strings.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4efde2a0-67a9-4d4b-bfca-02f5d15477fe
📒 Files selected for processing (4)
bin/lib/onboard.jsscripts/setup.shtest/gateway-cleanup.test.jsuninstall.sh
| function destroyGateway() { | ||
| run(`openshell gateway destroy -g ${GATEWAY_NAME} 2>/dev/null || true`, { ignoreError: true }); | ||
| // openshell gateway destroy doesn't remove Docker volumes, which leaves | ||
| // corrupted cluster state that breaks the next gateway start. Clean them up. | ||
| run(`docker volume ls -q --filter "name=openshell-cluster-${GATEWAY_NAME}" | xargs -r docker volume rm 2>/dev/null || true`, { ignoreError: true }); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's read the entire onboard.js file to see the full context
wc -l bin/lib/onboard.jsRepository: NVIDIA/NemoClaw
Length of output: 81
🏁 Script executed:
# Read the file to see context around the flagged lines and downstream behavior
cat -n bin/lib/onboard.js | head -200Repository: NVIDIA/NemoClaw
Length of output: 9012
🏁 Script executed:
# Search for "Stale state removed" or similar messages that claim cleanup success
rg -i "stale|remove|clean" bin/lib/onboard.jsRepository: NVIDIA/NemoClaw
Length of output: 806
🌐 Web query:
xargs -r option BSD macOS compatibility GNU
💡 Result:
- GNU
xargs -r(--no-run-if-empty) means: don’t run the command if there are zero input items. (gnu.org) - BSD/macOS
/usr/bin/xargsdoes not support-r(you’ll getxargs: illegal option -- r) and generally already behaves like “no-run-if-empty”, so-ris unnecessary on macOS. (stackoverflow.com) - Portability advice (GNU + macOS):
- Usually: just drop
-rin scripts intended to run on macOS and GNU/Linux. - If you specifically need GNU
xargson macOS, install GNU findutils (e.g., Homebrew) and use the GNU-prefixed binary (commonlygxargs). (apple.stackexchange.com)
- Usually: just drop
Citations:
- 1: https://www.gnu.org/software/findutils/manual/find.pdf?utm_source=openai
- 2: https://stackoverflow.com/questions/8296710/how-to-ignore-xargs-commands-if-stdin-input-is-empty?utm_source=openai
- 3: https://apple.stackexchange.com/questions/286129/how-to-get-xargs-not-to-eat-double-spaces?utm_source=openai
Fix xargs portability and surface cleanup failures to the caller.
The xargs -r option (line 129) is GNU-specific and not available on BSD/macOS—it will fail with "illegal option -- r". Since all errors are silently suppressed with ignoreError: true and || true, Docker volumes left behind by the failed cleanup are never reported. Users on Colima/Docker Desktop see "Stale state removed" (lines 151, 165) even though volumes remain corrupted, causing the next gateway start to fail.
Remove the -r flag (BSD/macOS xargs already behaves like --no-run-if-empty by default) and let destroyGateway() return a success/failure signal so the caller can show the manual recovery command when cleanup fails.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bin/lib/onboard.js` around lines 125 - 130, The destroyGateway function
currently calls run(...) with xargs -r (GNU-only) and swallows all errors via {
ignoreError: true } and || true; remove the -r flag so xargs is portable on
BSD/macOS, and change destroyGateway (and its two run invocations that reference
GATEWAY_NAME) to propagate success/failure (e.g., return boolean or throw)
instead of always ignoring errors so callers can detect cleanup failures and
show the manual recovery command; update call sites that assume cleanup always
succeeded to check destroyGateway's result and display the manual docker volume
rm instruction when it fails.
| docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true | ||
| GATEWAY_ARGS=(--name nemoclaw) | ||
| command -v nvidia-smi > /dev/null 2>&1 && GATEWAY_ARGS+=(--gpu) | ||
| openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1 | grep -E "Gateway|✓|Error|error" || true | ||
| if ! openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1 | grep -E "Gateway|✓|Error|error"; then | ||
| warn "Gateway start failed. Cleaning up stale state..." | ||
| openshell gateway destroy -g nemoclaw > /dev/null 2>&1 || true | ||
| docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true | ||
| fail "Stale state removed. Please rerun: nemoclaw onboard" | ||
| fi | ||
|
|
||
| # Verify gateway is actually healthy (may need a moment after start) | ||
| for i in 1 2 3 4 5; do | ||
| if openshell status 2>&1 | grep -q "Connected"; then | ||
| break | ||
| fi | ||
| [ "$i" -eq 5 ] && fail "Gateway failed to start. Check 'openshell gateway info' and Docker logs." | ||
| if [ "$i" -eq 5 ]; then | ||
| warn "Gateway health check failed. Cleaning up stale state..." | ||
| openshell gateway destroy -g nemoclaw > /dev/null 2>&1 || true | ||
| docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true | ||
| fail "Stale state removed. Please rerun: nemoclaw onboard" | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and examine the scripts/setup.sh file
find . -name "setup.sh" -path "*/scripts/*" 2>/dev/nullRepository: NVIDIA/NemoClaw
Length of output: 77
🏁 Script executed:
# Check the content at the specified lines
if [ -f "scripts/setup.sh" ]; then
echo "=== Lines 77-97 of scripts/setup.sh ===" && sed -n '77,97p' scripts/setup.sh
echo ""
echo "=== Full script context (first 100 lines) ===" && head -100 scripts/setup.sh | cat -n
fiRepository: NVIDIA/NemoClaw
Length of output: 5593
🏁 Script executed:
# Check for platform support documentation or shebang
if [ -f "scripts/setup.sh" ]; then
echo "=== First 30 lines ===" && head -30 scripts/setup.sh | cat -n
fiRepository: NVIDIA/NemoClaw
Length of output: 1265
🌐 Web query:
xargs -r flag macOS BSD portability GNU coreutils
💡 Result:
xargs -r/--no-run-if-emptyis a GNU extension (from GNU findutils’xargs), and GNUxargsotherwise runs the command once even with empty input. [1]- macOS ships BSD
xargs, which does not accept-r, and (usefully) it already avoids running the command when there’s no input, so-ris unnecessary on macOS. [2]
Portable patterns
- Most portable fix: just remove
-rwhen targeting macOS/BSD + GNU (BSD behavior already matches what you wanted). [2] - If you need GNU behavior/features consistently on macOS: install GNU findutils and use GNU
xargs(often asgxargs, or by prepending Homebrew’sgnubintoPATH). [3]
Sources
- GNU findutils manual (xargs
-rdescribed as GNU extension; default runs once with empty input). [1] - BSD/macOS behavior and lack of
-roption. [2] - Homebrew
findutilsformula (GNU find/xargs/locate +gnubinPATH hint). [3]
Replace xargs -r with a portable alternative for macOS/BSD support.
Lines 77, 83, and 95 use xargs -r, which is a GNU extension not available on macOS/BSD. Because stderr is redirected and the command is wrapped in || true, the script silently fails to remove the volume on Colima/Docker Desktop and still reports "Stale state removed." The script explicitly lists Colima and Docker Desktop as supported platforms (line 8).
Since BSD xargs already avoids running the command with empty input (matching the intended behavior of -r), remove the -r flag to make the cleanup portable across all documented platforms.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/setup.sh` around lines 77 - 97, The script uses the GNU-only xargs -r
flag in the docker volume removal pipelines (the commands that start with
"docker volume ls -q --filter \"name=openshell-cluster-nemoclaw\" | xargs -r
docker volume rm ...") which breaks on macOS/BSD; remove the -r flag so the
pipeline becomes portable, keeping the surrounding stderr redirection and the
existing cleanup flow that calls openshell gateway destroy -g nemoclaw, docker
volume rm, warn and fail unchanged to preserve behavior when cleanup is needed.
| if ! openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1 | grep -E "Gateway|✓|Error|error"; then | ||
| warn "Gateway start failed. Cleaning up stale state..." | ||
| openshell gateway destroy -g nemoclaw > /dev/null 2>&1 || true | ||
| docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true | ||
| fail "Stale state removed. Please rerun: nemoclaw onboard" | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n scripts/setup.sh | head -100Repository: NVIDIA/NemoClaw
Length of output: 4399
🏁 Script executed:
cd /tmp && cat > test_pipefail.sh << 'EOF'
#!/bin/bash
set -euo pipefail
# Simulate gateway start that succeeds but has no keywords in output
test_case_1() {
echo "Test 1: Command succeeds (exit 0) but no keywords in output"
# Mock: openshell gateway start outputs something without the keywords
if ! (echo "Starting gateway..." && exit 0) 2>&1 | grep -E "Gateway|✓|Error|error"; then
echo " → Cleanup triggered (PROBLEM: success command treated as failure)"
else
echo " → No cleanup (correct)"
fi
}
# Simulate gateway start that fails
test_case_2() {
echo "Test 2: Command fails (exit 1) with keywords in output"
if ! (echo "Error: something" && exit 1) 2>&1 | grep -E "Gateway|✓|Error|error"; then
echo " → Cleanup triggered (correct)"
else
echo " → No cleanup (PROBLEM: failure not detected)"
fi
}
test_case_1
test_case_2
EOF
bash test_case_1.shRepository: NVIDIA/NemoClaw
Length of output: 108
🏁 Script executed:
bash << 'EOF'
set -euo pipefail
echo "=== Test 1: Command succeeds (exit 0) but output has no keywords ==="
if ! (echo "Starting gateway..." && exit 0) 2>&1 | grep -E "Gateway|✓|Error|error"; then
echo "Result: Cleanup TRIGGERED (bad - success treated as failure)"
else
echo "Result: No cleanup (good)"
fi
echo ""
echo "=== Test 2: Command fails (exit 1) but output HAS keywords ==="
if ! (echo "Error: something" && exit 1) 2>&1 | grep -E "Gateway|✓|Error|error"; then
echo "Result: Cleanup TRIGGERED (correct)"
else
echo "Result: No cleanup (bad - failure not detected)"
fi
EOFRepository: NVIDIA/NemoClaw
Length of output: 305
Gate gateway cleanup on the start command's exit code, not on log output matching.
With set -o pipefail, the current condition if ! ... | grep -E "Gateway|✓|Error|error" fails whenever grep finds no match—even if openshell gateway start exits 0. This causes a healthy gateway to be destroyed if the CLI's output format changes to omit those keywords. Capture the exit code separately and gate cleanup on that instead.
Suggested fix
-if ! openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1 | grep -E "Gateway|✓|Error|error"; then
+set +e
+START_OUTPUT=$(openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1)
+START_RC=$?
+set -e
+printf '%s\n' "$START_OUTPUT" | grep -E "Gateway|✓|Error|error" || true
+if [ "$START_RC" -ne 0 ]; then
warn "Gateway start failed. Cleaning up stale state..."
openshell gateway destroy -g nemoclaw > /dev/null 2>&1 || true
docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true
fail "Stale state removed. Please rerun: nemoclaw onboard"
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if ! openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1 | grep -E "Gateway|✓|Error|error"; then | |
| warn "Gateway start failed. Cleaning up stale state..." | |
| openshell gateway destroy -g nemoclaw > /dev/null 2>&1 || true | |
| docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true | |
| fail "Stale state removed. Please rerun: nemoclaw onboard" | |
| fi | |
| set +e | |
| START_OUTPUT=$(openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1) | |
| START_RC=$? | |
| set -e | |
| printf '%s\n' "$START_OUTPUT" | grep -E "Gateway|✓|Error|error" || true | |
| if [ "$START_RC" -ne 0 ]; then | |
| warn "Gateway start failed. Cleaning up stale state..." | |
| openshell gateway destroy -g nemoclaw > /dev/null 2>&1 || true | |
| docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true | |
| fail "Stale state removed. Please rerun: nemoclaw onboard" | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/setup.sh` around lines 80 - 85, The current check uses the exit of
grep on the piped output to decide success; instead run the openshell gateway
start command and capture both its exit code and output (e.g., capture output
into a variable or tee it to a temp log), then use the actual exit code of
openshell (saved into a variable like start_status or via PIPESTATUS[0]) to
decide whether to run the cleanup; you may still grep the captured output for
logging/diagnostics but ensure the cleanup block (the openshell gateway destroy
and docker volume rm calls) is executed only when the openshell gateway start
exit status indicates failure.
| describe("gateway cleanup: Docker volumes removed on failure (#17)", () => { | ||
| it("onboard.js: destroyGateway() removes Docker volumes", () => { | ||
| const content = fs.readFileSync(path.join(ROOT, "bin/lib/onboard.js"), "utf-8"); | ||
| assert.ok( | ||
| content.includes("docker volume") && content.includes("openshell-cluster"), | ||
| "onboard.js must contain Docker volume cleanup for openshell-cluster volumes" | ||
| ); | ||
| }); | ||
|
|
||
| it("onboard.js: volume cleanup runs on gateway start failure", () => { | ||
| const content = fs.readFileSync(path.join(ROOT, "bin/lib/onboard.js"), "utf-8"); | ||
| // The startGateway function should call destroyGateway after a failed start | ||
| const startGwBlock = content.match(/async function startGateway[\s\S]*?^}/m); | ||
| assert.ok(startGwBlock, "Could not find startGateway function"); | ||
|
|
||
| // Count calls to destroyGateway — should be at least 3: | ||
| // 1. pre-cleanup before start | ||
| // 2. after start failure | ||
| // 3. after health check failure | ||
| const calls = (startGwBlock[0].match(/destroyGateway\(\)/g) || []).length; | ||
| assert.ok( | ||
| calls >= 3, | ||
| `startGateway must call destroyGateway() at least 3 times (pre-start, start failure, health failure), found ${calls}` | ||
| ); | ||
| }); | ||
|
|
||
| it("uninstall.sh: includes Docker volume cleanup", () => { | ||
| const content = fs.readFileSync(path.join(ROOT, "uninstall.sh"), "utf-8"); | ||
| assert.ok( | ||
| content.includes("docker volume") && content.includes("openshell-cluster"), | ||
| "uninstall.sh must remove openshell-cluster Docker volumes" | ||
| ); | ||
| assert.ok( | ||
| content.includes("remove_related_docker_volumes"), | ||
| "uninstall.sh must define and call remove_related_docker_volumes()" | ||
| ); | ||
| }); | ||
|
|
||
| it("setup.sh: includes Docker volume cleanup on failure", () => { | ||
| const content = fs.readFileSync(path.join(ROOT, "scripts/setup.sh"), "utf-8"); | ||
| assert.ok( | ||
| content.includes("docker volume") && content.includes("openshell-cluster"), | ||
| "setup.sh must remove openshell-cluster Docker volumes on failure" | ||
| ); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
These tests won't catch runtime cleanup regressions.
All four cases only read source files and assert on strings or call counts. That won't prove rerunning onboarding after an intentionally failed start works, and CI will still pass if destroyGateway() becomes a no-op, if the shell cleanup is non-portable, or if the failure path prints the wrong recovery command. Please add at least one behavior-level test that exercises a failed start/health-check path and asserts the cleanup command is actually invoked.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/gateway-cleanup.test.js` around lines 17 - 62, Tests only check source
text and call counts; add an integration-style test that actually exercises a
failed gateway start/health-check and asserts the cleanup command runs. Create a
new test that launches the real onboarding process (bin/lib/onboard.js or the
bin script) via child_process.spawn or fork with an env var that forces
startGateway to fail (or put a fake docker binary early in PATH that records
invocations), wait for the failure path, and then assert that destroyGateway was
invoked by checking captured stdout/stderr for the docker volume removal command
(or that the fake docker binary was called with "volume rm openshell-cluster")
and/or that remove_related_docker_volumes / uninstall.sh cleanup logic was
executed; reference startGateway, destroyGateway, remove_related_docker_volumes,
onboard.js, uninstall.sh, and scripts/setup.sh so the test locates and exercises
the real cleanup code rather than just matching strings.
cv
left a comment
There was a problem hiding this comment.
LGTM — clean fix for the corrupted volume problem. destroyGateway() extracts the cleanup well, all failure paths covered, and the regression tests verify it.
cv
left a comment
There was a problem hiding this comment.
Consistent fix across all three code paths, tests verify the right invariants.
xargs -r is not portable to macOS
The volume cleanup command uses xargs -r (GNU extension — don't run if stdin is empty). macOS xargs doesn't support -r. On macOS with no matching volumes, xargs will call docker volume rm with no arguments, which prints a usage error. Harmless due to || true, but noisy. This applies to onboard.js, setup.sh, and uninstall.sh — all three run on macOS.
Options: pipe through grep . first, use an if guard, or use xargs without -r and accept the usage noise (already suppressed by 2>/dev/null || true).
Minor: possible double-cleanup in uninstall.sh
The new explicit calls to remove_related_docker_containers, remove_related_docker_volumes, and remove_related_docker_images (lines 570-578) run immediately after remove_docker_resources (line 568). If remove_docker_resources already handles some of these, this results in redundant cleanup attempts. Not harmful, but worth verifying there's no overlap.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
scripts/setup.sh (2)
115-120:⚠️ Potential issue | 🟠 MajorGate cleanup on
openshell gateway startexit status, not grep match.Line 115 currently ties failure handling to
grepoutput. Withset -o pipefail, a successful start can still enter cleanup if output stops matching"Gateway|✓|Error|error".Suggested patch
-if ! openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1 | grep -E "Gateway|✓|Error|error"; then +set +e +START_OUTPUT="$(openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1)" +START_RC=$? +set -e +printf '%s\n' "$START_OUTPUT" | grep -E "Gateway|✓|Error|error" || true +if [ "$START_RC" -ne 0 ]; then warn "Gateway start failed. Cleaning up stale state..." openshell gateway destroy -g nemoclaw >/dev/null 2>&1 || true - docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true + docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs docker volume rm 2>/dev/null || true fail "Stale state removed. Please rerun: nemoclaw onboard" fi#!/usr/bin/env bash set -euo pipefail echo "Case 1: command succeeds, grep misses => currently treated as failure" if ! (echo "Starting gateway..." && exit 0) 2>&1 | grep -E "Gateway|✓|Error|error"; then echo "cleanup branch entered" else echo "no cleanup" fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/setup.sh` around lines 115 - 120, The cleanup branch currently triggers based on the grep match rather than the actual exit status of the starts command; change the logic so you run openshell gateway start "${GATEWAY_ARGS[@]}" and check its exit code (e.g., if ! openshell gateway start ...; then cleanup) instead of relying on the pipe to grep for "Gateway|✓|Error|error". If you still need to inspect stdout/stderr for error patterns, capture the command output into a variable (or a temp file), examine that with grep to log or decide additional actions, but always use the process exit status of openshell gateway start as the primary condition for running the cleanup steps that reference openshell gateway destroy and docker volume rm.
112-112:⚠️ Potential issue | 🟠 MajorUse portable
xargsinvocation for volume cleanup.Line 112, Line 118, and Line 130 use GNU-only
xargs -r. On macOS/BSD this can fail silently here (stderr redirected +|| true), leaving stale volumes behind.Suggested patch
-docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true +docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs docker volume rm 2>/dev/null || trueIs `xargs -r` supported on macOS/BSD `xargs`, and is removing `-r` the portable fix when input may be empty?Also applies to: 118-118, 130-130
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/setup.sh` at line 112, The xargs -r usage in the docker volume cleanup commands (the lines invoking "docker volume ls -q --filter \"name=openshell-cluster-nemoclaw\" | xargs -r docker volume rm") is GNU-only and not portable to macOS/BSD; replace the pipeline with a portability-safe check: capture the output of docker volume ls -q --filter "...", test if it's non-empty (e.g. ids="$(docker volume ls -q --filter 'name=...')" and if [ -n "$ids" ]; then echo "$ids" | xargs docker volume rm; fi) for each occurrence (the three lines that currently use xargs -r) so you avoid relying on -r and don't attempt to remove when there are no volumes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/setup.sh`:
- Around line 117-119: The script currently runs best-effort cleanup commands
(openshell gateway destroy, docker volume rm via docker volume ls | xargs) with
"|| true" then unconditionally calls fail("Stale state removed. Please rerun:
nemoclaw onboard"), which falsely claims success; update the cleanup blocks
around the openshell gateway destroy / docker volume ls | xargs sequences and
the duplicate block later (same pattern) to detect failures instead of
swallowing them: capture each command's exit status (or run them without "||
true"), aggregate any failures, and only emit a success message if all cleanup
steps succeeded — otherwise print a clear "Attempted to remove stale state;
manual recovery steps:" message listing the exact commands that failed (or the
full commands to run manually) and avoid calling fail with the misleading "Stale
state removed..." text.
---
Duplicate comments:
In `@scripts/setup.sh`:
- Around line 115-120: The cleanup branch currently triggers based on the grep
match rather than the actual exit status of the starts command; change the logic
so you run openshell gateway start "${GATEWAY_ARGS[@]}" and check its exit code
(e.g., if ! openshell gateway start ...; then cleanup) instead of relying on the
pipe to grep for "Gateway|✓|Error|error". If you still need to inspect
stdout/stderr for error patterns, capture the command output into a variable (or
a temp file), examine that with grep to log or decide additional actions, but
always use the process exit status of openshell gateway start as the primary
condition for running the cleanup steps that reference openshell gateway destroy
and docker volume rm.
- Line 112: The xargs -r usage in the docker volume cleanup commands (the lines
invoking "docker volume ls -q --filter \"name=openshell-cluster-nemoclaw\" |
xargs -r docker volume rm") is GNU-only and not portable to macOS/BSD; replace
the pipeline with a portability-safe check: capture the output of docker volume
ls -q --filter "...", test if it's non-empty (e.g. ids="$(docker volume ls -q
--filter 'name=...')" and if [ -n "$ids" ]; then echo "$ids" | xargs docker
volume rm; fi) for each occurrence (the three lines that currently use xargs -r)
so you avoid relying on -r and don't attempt to remove when there are no
volumes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3aa03a49-975d-4e56-826f-886a576f758b
📒 Files selected for processing (3)
bin/lib/onboard.jsscripts/setup.shuninstall.sh
🚧 Files skipped from review as they are similar to previous changes (2)
- uninstall.sh
- bin/lib/onboard.js
| openshell gateway destroy -g nemoclaw >/dev/null 2>&1 || true | ||
| docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true | ||
| fail "Stale state removed. Please rerun: nemoclaw onboard" |
There was a problem hiding this comment.
Don’t claim cleanup succeeded when cleanup is best-effort.
Line 119 and Line 131 always say "Stale state removed...", but cleanup commands are || true. If cleanup fails, the user gets a misleading success message and no exact manual recovery command.
Suggested patch direction
- fail "Stale state removed. Please rerun: nemoclaw onboard"
+ if docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | grep -q .; then
+ fail "Automatic cleanup failed. Run: openshell gateway destroy -g nemoclaw && docker volume rm openshell-cluster-nemoclaw"
+ fi
+ fail "Stale state removed. Please rerun: nemoclaw onboard"Also applies to: 129-131
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/setup.sh` around lines 117 - 119, The script currently runs
best-effort cleanup commands (openshell gateway destroy, docker volume rm via
docker volume ls | xargs) with "|| true" then unconditionally calls fail("Stale
state removed. Please rerun: nemoclaw onboard"), which falsely claims success;
update the cleanup blocks around the openshell gateway destroy / docker volume
ls | xargs sequences and the duplicate block later (same pattern) to detect
failures instead of swallowing them: capture each command's exit status (or run
them without "|| true"), aggregate any failures, and only emit a success message
if all cleanup steps succeeded — otherwise print a clear "Attempted to remove
stale state; manual recovery steps:" message listing the exact commands that
failed (or the full commands to run manually) and avoid calling fail with the
misleading "Stale state removed..." text.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/gateway-cleanup.test.js (1)
17-22: Assertion is weak—strings could appear anywhere in the file.The test passes if
"docker volume"and"openshell-cluster"appear anywhere inonboard.js, not necessarily withindestroyGateway(). Consider tightening the assertion to match within the actual function body.♻️ Suggested improvement
it("onboard.js: destroyGateway() removes Docker volumes", () => { const content = fs.readFileSync(path.join(ROOT, "bin/lib/onboard.js"), "utf-8"); - expect( - content.includes("docker volume") && content.includes("openshell-cluster"), - ).toBe(true); + // Extract destroyGateway function and verify volume cleanup is inside it + const destroyGwMatch = content.match(/(?:async\s+)?function\s+destroyGateway[\s\S]*?^}/m); + expect(destroyGwMatch).toBeTruthy(); + expect(destroyGwMatch[0]).toMatch(/docker volume/); + expect(destroyGwMatch[0]).toMatch(/openshell-cluster/); });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/gateway-cleanup.test.js` around lines 17 - 22, The current test checks for literal strings anywhere in the file which is too broad; update the test in gateway-cleanup.test.js to assert those strings appear inside the destroyGateway function body specifically by extracting that function (e.g., locate the function declaration/assignment for destroyGateway and capture its source via a regex or an AST parse) and then assert the captured body contains "docker volume" and "openshell-cluster"; ensure the test fails if the function body does not include both strings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/gateway-cleanup.test.js`:
- Around line 24-36: The regex used to extract the async function startGateway
is brittle and can stop at the first unindented closing brace, truncating the
function and missing destroyGateway() calls; replace the regex with a more
robust parser approach (or a balanced-brace search) to reliably capture the
entire startGateway function body when scanning onboard.js, then recount
destroyGateway() occurrences inside that full body; also avoid reading
onboard.js twice by reusing the previously read content variable (content)
instead of calling fs.readFileSync a second time.
---
Nitpick comments:
In `@test/gateway-cleanup.test.js`:
- Around line 17-22: The current test checks for literal strings anywhere in the
file which is too broad; update the test in gateway-cleanup.test.js to assert
those strings appear inside the destroyGateway function body specifically by
extracting that function (e.g., locate the function declaration/assignment for
destroyGateway and capture its source via a regex or an AST parse) and then
assert the captured body contains "docker volume" and "openshell-cluster";
ensure the test fails if the function body does not include both strings.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e48ef391-c346-438c-a8b6-6b90f1b1a31c
📒 Files selected for processing (1)
test/gateway-cleanup.test.js
| it("onboard.js: volume cleanup runs on gateway start failure", () => { | ||
| const content = fs.readFileSync(path.join(ROOT, "bin/lib/onboard.js"), "utf-8"); | ||
| // The startGateway function should call destroyGateway after a failed start | ||
| const startGwBlock = content.match(/async function startGateway[\s\S]*?^}/m); | ||
| expect(startGwBlock).toBeTruthy(); | ||
|
|
||
| // Count calls to destroyGateway — should be at least 3: | ||
| // 1. pre-cleanup before start | ||
| // 2. after start failure | ||
| // 3. after health check failure | ||
| const calls = (startGwBlock[0].match(/destroyGateway\(\)/g) || []).length; | ||
| expect(calls).toBeGreaterThanOrEqual(3); | ||
| }); |
There was a problem hiding this comment.
Regex may capture an incomplete function body if nested blocks exist.
The non-greedy [\s\S]*? with ^} stops at the first closing brace at line start. If startGateway contains nested blocks where a } happens to be unindented, the regex will truncate early and miss later destroyGateway() calls—causing a false failure.
Also, onboard.js is read twice (here and line 18); consider reading once and reusing.
♻️ Suggested improvement
+const onboardContent = fs.readFileSync(path.join(ROOT, "bin/lib/onboard.js"), "utf-8");
+
describe("gateway cleanup: Docker volumes removed on failure (`#17`)", () => {
it("onboard.js: destroyGateway() removes Docker volumes", () => {
- const content = fs.readFileSync(path.join(ROOT, "bin/lib/onboard.js"), "utf-8");
expect(
- content.includes("docker volume") && content.includes("openshell-cluster"),
+ onboardContent.includes("docker volume") && onboardContent.includes("openshell-cluster"),
).toBe(true);
});
it("onboard.js: volume cleanup runs on gateway start failure", () => {
- const content = fs.readFileSync(path.join(ROOT, "bin/lib/onboard.js"), "utf-8");
- // The startGateway function should call destroyGateway after a failed start
- const startGwBlock = content.match(/async function startGateway[\s\S]*?^}/m);
+ // Match function start and count braces to find the complete body
+ const fnStart = onboardContent.indexOf("async function startGateway");
+ expect(fnStart).toBeGreaterThan(-1);
+
+ let depth = 0, started = false, end = fnStart;
+ for (let i = fnStart; i < onboardContent.length; i++) {
+ if (onboardContent[i] === "{") { depth++; started = true; }
+ else if (onboardContent[i] === "}") { depth--; }
+ if (started && depth === 0) { end = i + 1; break; }
+ }
+ const startGwBlock = onboardContent.slice(fnStart, end);
expect(startGwBlock).toBeTruthy();
// Count calls to destroyGateway — should be at least 3:
- const calls = (startGwBlock[0].match(/destroyGateway\(\)/g) || []).length;
+ const calls = (startGwBlock.match(/destroyGateway\(\)/g) || []).length;
expect(calls).toBeGreaterThanOrEqual(3);
});🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/gateway-cleanup.test.js` around lines 24 - 36, The regex used to extract
the async function startGateway is brittle and can stop at the first unindented
closing brace, truncating the function and missing destroyGateway() calls;
replace the regex with a more robust parser approach (or a balanced-brace
search) to reliably capture the entire startGateway function body when scanning
onboard.js, then recount destroyGateway() occurrences inside that full body;
also avoid reading onboard.js twice by reusing the previously read content
variable (content) instead of calling fs.readFileSync a second time.
Closes #17.
Summary
destroyGateway()helper that removes both the gateway and its Docker volumes (openshell-cluster-nemoclaw)uninstall.shandscripts/setup.shProblem
When
openshell gateway startfails, it leaves behind a Docker volume (openshell-cluster-nemoclaw) in a corrupted state.openshell gateway destroydoesn't remove volumes. Rerunning onboard hits "Corrupted cluster state" with no guidance — users must manually rundocker volume rm openshell-cluster-nemoclaw.Fix
destroyGateway()callsopenshell gateway destroythen removes matching Docker volumes viadocker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm. On failure, it prints "Stale state removed. Please rerun: nemoclaw onboard".Test plan
uninstall.sh: volume cleanup worksSummary by CodeRabbit
Improvements
Bug Fixes
Tests