Skip to content

fix: clean up Docker volumes after failed gateway start#337

Open
ericksoa wants to merge 5 commits intomainfrom
fix/issue-17-gateway-volume-cleanup
Open

fix: clean up Docker volumes after failed gateway start#337
ericksoa wants to merge 5 commits intomainfrom
fix/issue-17-gateway-volume-cleanup

Conversation

@ericksoa
Copy link
Contributor

@ericksoa ericksoa commented Mar 18, 2026

Closes #17.

Summary

  • Extract destroyGateway() helper that removes both the gateway and its Docker volumes (openshell-cluster-nemoclaw)
  • Call it pre-start (existing cleanup), on start failure (new), and on health check failure (new)
  • Add volume cleanup to uninstall.sh and scripts/setup.sh
  • Add 4 regression tests to CI

Problem

When openshell gateway start fails, it leaves behind a Docker volume (openshell-cluster-nemoclaw) in a corrupted state. openshell gateway destroy doesn't remove volumes. Rerunning onboard hits "Corrupted cluster state" with no guidance — users must manually run docker volume rm openshell-cluster-nemoclaw.

Fix

destroyGateway() calls openshell gateway destroy then removes matching Docker volumes via docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm. On failure, it prints "Stale state removed. Please rerun: nemoclaw onboard".

Test plan

  • 56/56 unit tests pass (52 existing + 4 new gateway cleanup regression tests)
  • Simulated corrupted volume: onboard cleans it up automatically, no "Corrupted cluster state"
  • Gateway start failure: volumes cleaned up before exit
  • Health check failure: volumes cleaned up before exit
  • uninstall.sh: volume cleanup works
  • Code review: all three files have volume cleanup on all failure paths

Summary by CodeRabbit

  • Improvements

    • Gateway startup and health checks now automatically remove stale gateway state and print guidance to rerun onboarding ("Stale state removed. Please rerun: nemoclaw onboard").
    • Setup and uninstall flows now explicitly remove related Docker volumes with per-volume reporting.
  • Bug Fixes

    • Consistent cleanup on startup, start-failure, and health-check failures to avoid lingering state.
  • Tests

    • Added tests verifying gateway/setup/uninstall cleanup behavior and updated preset-validation assertions.

Closes #17.

When openshell gateway start fails, it leaves behind a Docker volume
(openshell-cluster-nemoclaw) in a corrupted state. Rerunning onboard
hits "Corrupted cluster state" because openshell gateway destroy
doesn't remove the volume.

Extracts a destroyGateway() helper that removes both the gateway and
its Docker volumes. Called before start (pre-cleanup), after start
failure, and after health check failure. Adds the same volume cleanup
to uninstall.sh and scripts/setup.sh.
Static analysis tests that verify Docker volume cleanup exists in
all three code paths (onboard.js, uninstall.sh, setup.sh) and that
destroyGateway() is called on pre-start, start failure, and health
check failure. Will catch regressions in CI.

Refs: #17
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 18, 2026

📝 Walkthrough

Walkthrough

Adds reusable gateway cleanup that destroys the named OpenShell gateway and removes related Docker volumes during gateway startup, start-failure, health-check failure, setup failure, and uninstall; adds a test verifying the presence of volume cleanup and updates preset tests to use Vitest assertions.

Changes

Cohort / File(s) Summary
Gateway startup & recovery
bin/lib/onboard.js, scripts/setup.sh
Introduce GATEWAY_NAME = "nemoclaw" and a destroyGateway() helper; replace direct destroy calls with the helper; on gateway start failure, start-result non-zero, setup failure, or final health-check failure, run gateway destroy and remove openshell-cluster-nemoclaw Docker volume(s), then exit with a consistent "Stale state removed. Please rerun: nemoclaw onboard" message.
Uninstall Docker volume cleanup
uninstall.sh
Add remove_related_docker_volumes() which checks Docker availability, lists openshell-cluster* volumes, and removes them with per-volume logging; integrate this step into main() between container and image removal.
Tests for cleanup
test/gateway-cleanup.test.js
New Vitest suite that reads bin/lib/onboard.js, uninstall.sh, and scripts/setup.sh as text and asserts presence of docker volume / openshell-cluster references and that destroyGateway() appears in startup, start-failure, and health-check failure paths.
Tests / assertions update
test/policies.test.js
Replace Node assert checks with Vitest expect assertions for preset-validation tests.

Sequence Diagram(s)

sequenceDiagram
    participant Onboard as Onboard script (bin/lib/onboard.js)
    participant GatewayCLI as "openshell gateway CLI"
    participant Docker as Docker (volumes)
    participant Health as Gateway Health Check

    Onboard->>GatewayCLI: run destroyGateway() (pre-start cleanup)
    GatewayCLI-->>Onboard: returns status
    Onboard->>GatewayCLI: run "gateway start -g nemoclaw"
    GatewayCLI-->>Onboard: startResult (status)
    alt startResult.status != 0
        Onboard->>GatewayCLI: run destroyGateway()
        Onboard->>Docker: remove openshell-cluster-nemoclaw volume(s)
        Onboard-->>User: "Stale state removed. Please rerun: nemoclaw onboard"
    else start succeeded
        Onboard->>Health: poll health (1..5)
        alt health == Connected
            Health-->>Onboard: Connected
        else after retries fail
            Onboard->>GatewayCLI: run destroyGateway()
            Onboard->>Docker: remove openshell-cluster-nemoclaw volume(s)
            Onboard-->>User: "Stale state removed. Please rerun: nemoclaw onboard"
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 I hopped in with scripts and a tiny plan,

Cleared stale volumes as only a rabbit can,
Destroyed the gateway, tidied Docker’s mess,
Rerun onboard — fresh state, no stress! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: clean up Docker volumes after failed gateway start' accurately summarizes the primary change and directly matches the main objective of the PR.
Linked Issues check ✅ Passed All code changes comprehensively implement the acceptance criteria from issue #17: destroyGateway() removes Docker volumes, cleanup occurs on startup/health-check failures, error messages guide recovery, and regression tests validate the solution.
Out of Scope Changes check ✅ Passed All changes directly address issue #17's objectives: gateway/volume cleanup logic in onboard.js, uninstall.sh, and setup.sh; regression tests; and test migration to Vitest are all scope-aligned.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/issue-17-gateway-volume-cleanup

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 125-130: The destroyGateway function currently calls run(...) with
xargs -r (GNU-only) and swallows all errors via { ignoreError: true } and ||
true; remove the -r flag so xargs is portable on BSD/macOS, and change
destroyGateway (and its two run invocations that reference GATEWAY_NAME) to
propagate success/failure (e.g., return boolean or throw) instead of always
ignoring errors so callers can detect cleanup failures and show the manual
recovery command; update call sites that assume cleanup always succeeded to
check destroyGateway's result and display the manual docker volume rm
instruction when it fails.

In `@scripts/setup.sh`:
- Around line 77-97: The script uses the GNU-only xargs -r flag in the docker
volume removal pipelines (the commands that start with "docker volume ls -q
--filter \"name=openshell-cluster-nemoclaw\" | xargs -r docker volume rm ...")
which breaks on macOS/BSD; remove the -r flag so the pipeline becomes portable,
keeping the surrounding stderr redirection and the existing cleanup flow that
calls openshell gateway destroy -g nemoclaw, docker volume rm, warn and fail
unchanged to preserve behavior when cleanup is needed.
- Around line 80-85: The current check uses the exit of grep on the piped output
to decide success; instead run the openshell gateway start command and capture
both its exit code and output (e.g., capture output into a variable or tee it to
a temp log), then use the actual exit code of openshell (saved into a variable
like start_status or via PIPESTATUS[0]) to decide whether to run the cleanup;
you may still grep the captured output for logging/diagnostics but ensure the
cleanup block (the openshell gateway destroy and docker volume rm calls) is
executed only when the openshell gateway start exit status indicates failure.

In `@test/gateway-cleanup.test.js`:
- Around line 17-62: Tests only check source text and call counts; add an
integration-style test that actually exercises a failed gateway
start/health-check and asserts the cleanup command runs. Create a new test that
launches the real onboarding process (bin/lib/onboard.js or the bin script) via
child_process.spawn or fork with an env var that forces startGateway to fail (or
put a fake docker binary early in PATH that records invocations), wait for the
failure path, and then assert that destroyGateway was invoked by checking
captured stdout/stderr for the docker volume removal command (or that the fake
docker binary was called with "volume rm openshell-cluster") and/or that
remove_related_docker_volumes / uninstall.sh cleanup logic was executed;
reference startGateway, destroyGateway, remove_related_docker_volumes,
onboard.js, uninstall.sh, and scripts/setup.sh so the test locates and exercises
the real cleanup code rather than just matching strings.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4efde2a0-67a9-4d4b-bfca-02f5d15477fe

📥 Commits

Reviewing files that changed from the base of the PR and between 241ffb2 and b1f1b4d.

📒 Files selected for processing (4)
  • bin/lib/onboard.js
  • scripts/setup.sh
  • test/gateway-cleanup.test.js
  • uninstall.sh

Comment on lines +125 to +130
function destroyGateway() {
run(`openshell gateway destroy -g ${GATEWAY_NAME} 2>/dev/null || true`, { ignoreError: true });
// openshell gateway destroy doesn't remove Docker volumes, which leaves
// corrupted cluster state that breaks the next gateway start. Clean them up.
run(`docker volume ls -q --filter "name=openshell-cluster-${GATEWAY_NAME}" | xargs -r docker volume rm 2>/dev/null || true`, { ignoreError: true });
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's read the entire onboard.js file to see the full context
wc -l bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 81


🏁 Script executed:

# Read the file to see context around the flagged lines and downstream behavior
cat -n bin/lib/onboard.js | head -200

Repository: NVIDIA/NemoClaw

Length of output: 9012


🏁 Script executed:

# Search for "Stale state removed" or similar messages that claim cleanup success
rg -i "stale|remove|clean" bin/lib/onboard.js

Repository: NVIDIA/NemoClaw

Length of output: 806


🌐 Web query:

xargs -r option BSD macOS compatibility GNU

💡 Result:

  • GNU xargs -r (--no-run-if-empty) means: don’t run the command if there are zero input items. (gnu.org)
  • BSD/macOS /usr/bin/xargs does not support -r (you’ll get xargs: illegal option -- r) and generally already behaves like “no-run-if-empty”, so -r is unnecessary on macOS. (stackoverflow.com)
  • Portability advice (GNU + macOS):
    • Usually: just drop -r in scripts intended to run on macOS and GNU/Linux.
    • If you specifically need GNU xargs on macOS, install GNU findutils (e.g., Homebrew) and use the GNU-prefixed binary (commonly gxargs). (apple.stackexchange.com)

Citations:


Fix xargs portability and surface cleanup failures to the caller.

The xargs -r option (line 129) is GNU-specific and not available on BSD/macOS—it will fail with "illegal option -- r". Since all errors are silently suppressed with ignoreError: true and || true, Docker volumes left behind by the failed cleanup are never reported. Users on Colima/Docker Desktop see "Stale state removed" (lines 151, 165) even though volumes remain corrupted, causing the next gateway start to fail.

Remove the -r flag (BSD/macOS xargs already behaves like --no-run-if-empty by default) and let destroyGateway() return a success/failure signal so the caller can show the manual recovery command when cleanup fails.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 125 - 130, The destroyGateway function
currently calls run(...) with xargs -r (GNU-only) and swallows all errors via {
ignoreError: true } and || true; remove the -r flag so xargs is portable on
BSD/macOS, and change destroyGateway (and its two run invocations that reference
GATEWAY_NAME) to propagate success/failure (e.g., return boolean or throw)
instead of always ignoring errors so callers can detect cleanup failures and
show the manual recovery command; update call sites that assume cleanup always
succeeded to check destroyGateway's result and display the manual docker volume
rm instruction when it fails.

Comment on lines +77 to +97
docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true
GATEWAY_ARGS=(--name nemoclaw)
command -v nvidia-smi > /dev/null 2>&1 && GATEWAY_ARGS+=(--gpu)
openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1 | grep -E "Gateway|✓|Error|error" || true
if ! openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1 | grep -E "Gateway|✓|Error|error"; then
warn "Gateway start failed. Cleaning up stale state..."
openshell gateway destroy -g nemoclaw > /dev/null 2>&1 || true
docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true
fail "Stale state removed. Please rerun: nemoclaw onboard"
fi

# Verify gateway is actually healthy (may need a moment after start)
for i in 1 2 3 4 5; do
if openshell status 2>&1 | grep -q "Connected"; then
break
fi
[ "$i" -eq 5 ] && fail "Gateway failed to start. Check 'openshell gateway info' and Docker logs."
if [ "$i" -eq 5 ]; then
warn "Gateway health check failed. Cleaning up stale state..."
openshell gateway destroy -g nemoclaw > /dev/null 2>&1 || true
docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true
fail "Stale state removed. Please rerun: nemoclaw onboard"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and examine the scripts/setup.sh file
find . -name "setup.sh" -path "*/scripts/*" 2>/dev/null

Repository: NVIDIA/NemoClaw

Length of output: 77


🏁 Script executed:

# Check the content at the specified lines
if [ -f "scripts/setup.sh" ]; then
  echo "=== Lines 77-97 of scripts/setup.sh ===" && sed -n '77,97p' scripts/setup.sh
  echo ""
  echo "=== Full script context (first 100 lines) ===" && head -100 scripts/setup.sh | cat -n
fi

Repository: NVIDIA/NemoClaw

Length of output: 5593


🏁 Script executed:

# Check for platform support documentation or shebang
if [ -f "scripts/setup.sh" ]; then
  echo "=== First 30 lines ===" && head -30 scripts/setup.sh | cat -n
fi

Repository: NVIDIA/NemoClaw

Length of output: 1265


🌐 Web query:

xargs -r flag macOS BSD portability GNU coreutils

💡 Result:

  • xargs -r / --no-run-if-empty is a GNU extension (from GNU findutils’ xargs), and GNU xargs otherwise runs the command once even with empty input. [1]
  • macOS ships BSD xargs, which does not accept -r, and (usefully) it already avoids running the command when there’s no input, so -r is unnecessary on macOS. [2]

Portable patterns

  • Most portable fix: just remove -r when targeting macOS/BSD + GNU (BSD behavior already matches what you wanted). [2]
  • If you need GNU behavior/features consistently on macOS: install GNU findutils and use GNU xargs (often as gxargs, or by prepending Homebrew’s gnubin to PATH). [3]

Sources

  1. GNU findutils manual (xargs -r described as GNU extension; default runs once with empty input). [1]
  2. BSD/macOS behavior and lack of -r option. [2]
  3. Homebrew findutils formula (GNU find/xargs/locate + gnubin PATH hint). [3]

Replace xargs -r with a portable alternative for macOS/BSD support.

Lines 77, 83, and 95 use xargs -r, which is a GNU extension not available on macOS/BSD. Because stderr is redirected and the command is wrapped in || true, the script silently fails to remove the volume on Colima/Docker Desktop and still reports "Stale state removed." The script explicitly lists Colima and Docker Desktop as supported platforms (line 8).

Since BSD xargs already avoids running the command with empty input (matching the intended behavior of -r), remove the -r flag to make the cleanup portable across all documented platforms.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup.sh` around lines 77 - 97, The script uses the GNU-only xargs -r
flag in the docker volume removal pipelines (the commands that start with
"docker volume ls -q --filter \"name=openshell-cluster-nemoclaw\" | xargs -r
docker volume rm ...") which breaks on macOS/BSD; remove the -r flag so the
pipeline becomes portable, keeping the surrounding stderr redirection and the
existing cleanup flow that calls openshell gateway destroy -g nemoclaw, docker
volume rm, warn and fail unchanged to preserve behavior when cleanup is needed.

Comment on lines +80 to +85
if ! openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1 | grep -E "Gateway|✓|Error|error"; then
warn "Gateway start failed. Cleaning up stale state..."
openshell gateway destroy -g nemoclaw > /dev/null 2>&1 || true
docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true
fail "Stale state removed. Please rerun: nemoclaw onboard"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n scripts/setup.sh | head -100

Repository: NVIDIA/NemoClaw

Length of output: 4399


🏁 Script executed:

cd /tmp && cat > test_pipefail.sh << 'EOF'
#!/bin/bash
set -euo pipefail

# Simulate gateway start that succeeds but has no keywords in output
test_case_1() {
  echo "Test 1: Command succeeds (exit 0) but no keywords in output"
  # Mock: openshell gateway start outputs something without the keywords
  if ! (echo "Starting gateway..." && exit 0) 2>&1 | grep -E "Gateway|✓|Error|error"; then
    echo "  → Cleanup triggered (PROBLEM: success command treated as failure)"
  else
    echo "  → No cleanup (correct)"
  fi
}

# Simulate gateway start that fails
test_case_2() {
  echo "Test 2: Command fails (exit 1) with keywords in output"
  if ! (echo "Error: something" && exit 1) 2>&1 | grep -E "Gateway|✓|Error|error"; then
    echo "  → Cleanup triggered (correct)"
  else
    echo "  → No cleanup (PROBLEM: failure not detected)"
  fi
}

test_case_1
test_case_2
EOF
bash test_case_1.sh

Repository: NVIDIA/NemoClaw

Length of output: 108


🏁 Script executed:

bash << 'EOF'
set -euo pipefail

echo "=== Test 1: Command succeeds (exit 0) but output has no keywords ==="
if ! (echo "Starting gateway..." && exit 0) 2>&1 | grep -E "Gateway|✓|Error|error"; then
  echo "Result: Cleanup TRIGGERED (bad - success treated as failure)"
else
  echo "Result: No cleanup (good)"
fi

echo ""
echo "=== Test 2: Command fails (exit 1) but output HAS keywords ==="
if ! (echo "Error: something" && exit 1) 2>&1 | grep -E "Gateway|✓|Error|error"; then
  echo "Result: Cleanup TRIGGERED (correct)"
else
  echo "Result: No cleanup (bad - failure not detected)"
fi
EOF

Repository: NVIDIA/NemoClaw

Length of output: 305


Gate gateway cleanup on the start command's exit code, not on log output matching.

With set -o pipefail, the current condition if ! ... | grep -E "Gateway|✓|Error|error" fails whenever grep finds no match—even if openshell gateway start exits 0. This causes a healthy gateway to be destroyed if the CLI's output format changes to omit those keywords. Capture the exit code separately and gate cleanup on that instead.

Suggested fix
-if ! openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1 | grep -E "Gateway|✓|Error|error"; then
+set +e
+START_OUTPUT=$(openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1)
+START_RC=$?
+set -e
+printf '%s\n' "$START_OUTPUT" | grep -E "Gateway|✓|Error|error" || true
+if [ "$START_RC" -ne 0 ]; then
   warn "Gateway start failed. Cleaning up stale state..."
   openshell gateway destroy -g nemoclaw > /dev/null 2>&1 || true
   docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true
   fail "Stale state removed. Please rerun: nemoclaw onboard"
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if ! openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1 | grep -E "Gateway|✓|Error|error"; then
warn "Gateway start failed. Cleaning up stale state..."
openshell gateway destroy -g nemoclaw > /dev/null 2>&1 || true
docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true
fail "Stale state removed. Please rerun: nemoclaw onboard"
fi
set +e
START_OUTPUT=$(openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1)
START_RC=$?
set -e
printf '%s\n' "$START_OUTPUT" | grep -E "Gateway|✓|Error|error" || true
if [ "$START_RC" -ne 0 ]; then
warn "Gateway start failed. Cleaning up stale state..."
openshell gateway destroy -g nemoclaw > /dev/null 2>&1 || true
docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true
fail "Stale state removed. Please rerun: nemoclaw onboard"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup.sh` around lines 80 - 85, The current check uses the exit of
grep on the piped output to decide success; instead run the openshell gateway
start command and capture both its exit code and output (e.g., capture output
into a variable or tee it to a temp log), then use the actual exit code of
openshell (saved into a variable like start_status or via PIPESTATUS[0]) to
decide whether to run the cleanup; you may still grep the captured output for
logging/diagnostics but ensure the cleanup block (the openshell gateway destroy
and docker volume rm calls) is executed only when the openshell gateway start
exit status indicates failure.

Comment on lines +17 to +62
describe("gateway cleanup: Docker volumes removed on failure (#17)", () => {
it("onboard.js: destroyGateway() removes Docker volumes", () => {
const content = fs.readFileSync(path.join(ROOT, "bin/lib/onboard.js"), "utf-8");
assert.ok(
content.includes("docker volume") && content.includes("openshell-cluster"),
"onboard.js must contain Docker volume cleanup for openshell-cluster volumes"
);
});

it("onboard.js: volume cleanup runs on gateway start failure", () => {
const content = fs.readFileSync(path.join(ROOT, "bin/lib/onboard.js"), "utf-8");
// The startGateway function should call destroyGateway after a failed start
const startGwBlock = content.match(/async function startGateway[\s\S]*?^}/m);
assert.ok(startGwBlock, "Could not find startGateway function");

// Count calls to destroyGateway — should be at least 3:
// 1. pre-cleanup before start
// 2. after start failure
// 3. after health check failure
const calls = (startGwBlock[0].match(/destroyGateway\(\)/g) || []).length;
assert.ok(
calls >= 3,
`startGateway must call destroyGateway() at least 3 times (pre-start, start failure, health failure), found ${calls}`
);
});

it("uninstall.sh: includes Docker volume cleanup", () => {
const content = fs.readFileSync(path.join(ROOT, "uninstall.sh"), "utf-8");
assert.ok(
content.includes("docker volume") && content.includes("openshell-cluster"),
"uninstall.sh must remove openshell-cluster Docker volumes"
);
assert.ok(
content.includes("remove_related_docker_volumes"),
"uninstall.sh must define and call remove_related_docker_volumes()"
);
});

it("setup.sh: includes Docker volume cleanup on failure", () => {
const content = fs.readFileSync(path.join(ROOT, "scripts/setup.sh"), "utf-8");
assert.ok(
content.includes("docker volume") && content.includes("openshell-cluster"),
"setup.sh must remove openshell-cluster Docker volumes on failure"
);
});
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

These tests won't catch runtime cleanup regressions.

All four cases only read source files and assert on strings or call counts. That won't prove rerunning onboarding after an intentionally failed start works, and CI will still pass if destroyGateway() becomes a no-op, if the shell cleanup is non-portable, or if the failure path prints the wrong recovery command. Please add at least one behavior-level test that exercises a failed start/health-check path and asserts the cleanup command is actually invoked.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/gateway-cleanup.test.js` around lines 17 - 62, Tests only check source
text and call counts; add an integration-style test that actually exercises a
failed gateway start/health-check and asserts the cleanup command runs. Create a
new test that launches the real onboarding process (bin/lib/onboard.js or the
bin script) via child_process.spawn or fork with an env var that forces
startGateway to fail (or put a fake docker binary early in PATH that records
invocations), wait for the failure path, and then assert that destroyGateway was
invoked by checking captured stdout/stderr for the docker volume removal command
(or that the fake docker binary was called with "volume rm openshell-cluster")
and/or that remove_related_docker_volumes / uninstall.sh cleanup logic was
executed; reference startGateway, destroyGateway, remove_related_docker_volumes,
onboard.js, uninstall.sh, and scripts/setup.sh so the test locates and exercises
the real cleanup code rather than just matching strings.

Copy link
Contributor

@cv cv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean fix for the corrupted volume problem. destroyGateway() extracts the cleanup well, all failure paths covered, and the regression tests verify it.

Copy link
Contributor

@cv cv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistent fix across all three code paths, tests verify the right invariants.

xargs -r is not portable to macOS

The volume cleanup command uses xargs -r (GNU extension — don't run if stdin is empty). macOS xargs doesn't support -r. On macOS with no matching volumes, xargs will call docker volume rm with no arguments, which prints a usage error. Harmless due to || true, but noisy. This applies to onboard.js, setup.sh, and uninstall.sh — all three run on macOS.

Options: pipe through grep . first, use an if guard, or use xargs without -r and accept the usage noise (already suppressed by 2>/dev/null || true).

Minor: possible double-cleanup in uninstall.sh

The new explicit calls to remove_related_docker_containers, remove_related_docker_volumes, and remove_related_docker_images (lines 570-578) run immediately after remove_docker_resources (line 568). If remove_docker_resources already handles some of these, this results in redundant cleanup attempts. Not harmful, but worth verifying there's no overlap.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
scripts/setup.sh (2)

115-120: ⚠️ Potential issue | 🟠 Major

Gate cleanup on openshell gateway start exit status, not grep match.

Line 115 currently ties failure handling to grep output. With set -o pipefail, a successful start can still enter cleanup if output stops matching "Gateway|✓|Error|error".

Suggested patch
-if ! openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1 | grep -E "Gateway|✓|Error|error"; then
+set +e
+START_OUTPUT="$(openshell gateway start "${GATEWAY_ARGS[@]}" 2>&1)"
+START_RC=$?
+set -e
+printf '%s\n' "$START_OUTPUT" | grep -E "Gateway|✓|Error|error" || true
+if [ "$START_RC" -ne 0 ]; then
   warn "Gateway start failed. Cleaning up stale state..."
   openshell gateway destroy -g nemoclaw >/dev/null 2>&1 || true
-  docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true
+  docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs docker volume rm 2>/dev/null || true
   fail "Stale state removed. Please rerun: nemoclaw onboard"
 fi
#!/usr/bin/env bash
set -euo pipefail

echo "Case 1: command succeeds, grep misses => currently treated as failure"
if ! (echo "Starting gateway..." && exit 0) 2>&1 | grep -E "Gateway|✓|Error|error"; then
  echo "cleanup branch entered"
else
  echo "no cleanup"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup.sh` around lines 115 - 120, The cleanup branch currently
triggers based on the grep match rather than the actual exit status of the
starts command; change the logic so you run openshell gateway start
"${GATEWAY_ARGS[@]}" and check its exit code (e.g., if ! openshell gateway start
...; then cleanup) instead of relying on the pipe to grep for
"Gateway|✓|Error|error". If you still need to inspect stdout/stderr for error
patterns, capture the command output into a variable (or a temp file), examine
that with grep to log or decide additional actions, but always use the process
exit status of openshell gateway start as the primary condition for running the
cleanup steps that reference openshell gateway destroy and docker volume rm.

112-112: ⚠️ Potential issue | 🟠 Major

Use portable xargs invocation for volume cleanup.

Line 112, Line 118, and Line 130 use GNU-only xargs -r. On macOS/BSD this can fail silently here (stderr redirected + || true), leaving stale volumes behind.

Suggested patch
-docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true
+docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs docker volume rm 2>/dev/null || true
Is `xargs -r` supported on macOS/BSD `xargs`, and is removing `-r` the portable fix when input may be empty?

Also applies to: 118-118, 130-130

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup.sh` at line 112, The xargs -r usage in the docker volume
cleanup commands (the lines invoking "docker volume ls -q --filter
\"name=openshell-cluster-nemoclaw\" | xargs -r docker volume rm") is GNU-only
and not portable to macOS/BSD; replace the pipeline with a portability-safe
check: capture the output of docker volume ls -q --filter "...", test if it's
non-empty (e.g. ids="$(docker volume ls -q --filter 'name=...')" and if [ -n
"$ids" ]; then echo "$ids" | xargs docker volume rm; fi) for each occurrence
(the three lines that currently use xargs -r) so you avoid relying on -r and
don't attempt to remove when there are no volumes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/setup.sh`:
- Around line 117-119: The script currently runs best-effort cleanup commands
(openshell gateway destroy, docker volume rm via docker volume ls | xargs) with
"|| true" then unconditionally calls fail("Stale state removed. Please rerun:
nemoclaw onboard"), which falsely claims success; update the cleanup blocks
around the openshell gateway destroy / docker volume ls | xargs sequences and
the duplicate block later (same pattern) to detect failures instead of
swallowing them: capture each command's exit status (or run them without "||
true"), aggregate any failures, and only emit a success message if all cleanup
steps succeeded — otherwise print a clear "Attempted to remove stale state;
manual recovery steps:" message listing the exact commands that failed (or the
full commands to run manually) and avoid calling fail with the misleading "Stale
state removed..." text.

---

Duplicate comments:
In `@scripts/setup.sh`:
- Around line 115-120: The cleanup branch currently triggers based on the grep
match rather than the actual exit status of the starts command; change the logic
so you run openshell gateway start "${GATEWAY_ARGS[@]}" and check its exit code
(e.g., if ! openshell gateway start ...; then cleanup) instead of relying on the
pipe to grep for "Gateway|✓|Error|error". If you still need to inspect
stdout/stderr for error patterns, capture the command output into a variable (or
a temp file), examine that with grep to log or decide additional actions, but
always use the process exit status of openshell gateway start as the primary
condition for running the cleanup steps that reference openshell gateway destroy
and docker volume rm.
- Line 112: The xargs -r usage in the docker volume cleanup commands (the lines
invoking "docker volume ls -q --filter \"name=openshell-cluster-nemoclaw\" |
xargs -r docker volume rm") is GNU-only and not portable to macOS/BSD; replace
the pipeline with a portability-safe check: capture the output of docker volume
ls -q --filter "...", test if it's non-empty (e.g. ids="$(docker volume ls -q
--filter 'name=...')" and if [ -n "$ids" ]; then echo "$ids" | xargs docker
volume rm; fi) for each occurrence (the three lines that currently use xargs -r)
so you avoid relying on -r and don't attempt to remove when there are no
volumes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3aa03a49-975d-4e56-826f-886a576f758b

📥 Commits

Reviewing files that changed from the base of the PR and between b1f1b4d and 12bc293.

📒 Files selected for processing (3)
  • bin/lib/onboard.js
  • scripts/setup.sh
  • uninstall.sh
🚧 Files skipped from review as they are similar to previous changes (2)
  • uninstall.sh
  • bin/lib/onboard.js

Comment on lines +117 to +119
openshell gateway destroy -g nemoclaw >/dev/null 2>&1 || true
docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | xargs -r docker volume rm 2>/dev/null || true
fail "Stale state removed. Please rerun: nemoclaw onboard"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t claim cleanup succeeded when cleanup is best-effort.

Line 119 and Line 131 always say "Stale state removed...", but cleanup commands are || true. If cleanup fails, the user gets a misleading success message and no exact manual recovery command.

Suggested patch direction
-  fail "Stale state removed. Please rerun: nemoclaw onboard"
+  if docker volume ls -q --filter "name=openshell-cluster-nemoclaw" | grep -q .; then
+    fail "Automatic cleanup failed. Run: openshell gateway destroy -g nemoclaw && docker volume rm openshell-cluster-nemoclaw"
+  fi
+  fail "Stale state removed. Please rerun: nemoclaw onboard"

Also applies to: 129-131

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup.sh` around lines 117 - 119, The script currently runs
best-effort cleanup commands (openshell gateway destroy, docker volume rm via
docker volume ls | xargs) with "|| true" then unconditionally calls fail("Stale
state removed. Please rerun: nemoclaw onboard"), which falsely claims success;
update the cleanup blocks around the openshell gateway destroy / docker volume
ls | xargs sequences and the duplicate block later (same pattern) to detect
failures instead of swallowing them: capture each command's exit status (or run
them without "|| true"), aggregate any failures, and only emit a success message
if all cleanup steps succeeded — otherwise print a clear "Attempted to remove
stale state; manual recovery steps:" message listing the exact commands that
failed (or the full commands to run manually) and avoid calling fail with the
misleading "Stale state removed..." text.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/gateway-cleanup.test.js (1)

17-22: Assertion is weak—strings could appear anywhere in the file.

The test passes if "docker volume" and "openshell-cluster" appear anywhere in onboard.js, not necessarily within destroyGateway(). Consider tightening the assertion to match within the actual function body.

♻️ Suggested improvement
   it("onboard.js: destroyGateway() removes Docker volumes", () => {
     const content = fs.readFileSync(path.join(ROOT, "bin/lib/onboard.js"), "utf-8");
-    expect(
-      content.includes("docker volume") && content.includes("openshell-cluster"),
-    ).toBe(true);
+    // Extract destroyGateway function and verify volume cleanup is inside it
+    const destroyGwMatch = content.match(/(?:async\s+)?function\s+destroyGateway[\s\S]*?^}/m);
+    expect(destroyGwMatch).toBeTruthy();
+    expect(destroyGwMatch[0]).toMatch(/docker volume/);
+    expect(destroyGwMatch[0]).toMatch(/openshell-cluster/);
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/gateway-cleanup.test.js` around lines 17 - 22, The current test checks
for literal strings anywhere in the file which is too broad; update the test in
gateway-cleanup.test.js to assert those strings appear inside the destroyGateway
function body specifically by extracting that function (e.g., locate the
function declaration/assignment for destroyGateway and capture its source via a
regex or an AST parse) and then assert the captured body contains "docker
volume" and "openshell-cluster"; ensure the test fails if the function body does
not include both strings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/gateway-cleanup.test.js`:
- Around line 24-36: The regex used to extract the async function startGateway
is brittle and can stop at the first unindented closing brace, truncating the
function and missing destroyGateway() calls; replace the regex with a more
robust parser approach (or a balanced-brace search) to reliably capture the
entire startGateway function body when scanning onboard.js, then recount
destroyGateway() occurrences inside that full body; also avoid reading
onboard.js twice by reusing the previously read content variable (content)
instead of calling fs.readFileSync a second time.

---

Nitpick comments:
In `@test/gateway-cleanup.test.js`:
- Around line 17-22: The current test checks for literal strings anywhere in the
file which is too broad; update the test in gateway-cleanup.test.js to assert
those strings appear inside the destroyGateway function body specifically by
extracting that function (e.g., locate the function declaration/assignment for
destroyGateway and capture its source via a regex or an AST parse) and then
assert the captured body contains "docker volume" and "openshell-cluster";
ensure the test fails if the function body does not include both strings.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e48ef391-c346-438c-a8b6-6b90f1b1a31c

📥 Commits

Reviewing files that changed from the base of the PR and between c6548fb and 5e4a1b6.

📒 Files selected for processing (1)
  • test/gateway-cleanup.test.js

Comment on lines +24 to +36
it("onboard.js: volume cleanup runs on gateway start failure", () => {
const content = fs.readFileSync(path.join(ROOT, "bin/lib/onboard.js"), "utf-8");
// The startGateway function should call destroyGateway after a failed start
const startGwBlock = content.match(/async function startGateway[\s\S]*?^}/m);
expect(startGwBlock).toBeTruthy();

// Count calls to destroyGateway — should be at least 3:
// 1. pre-cleanup before start
// 2. after start failure
// 3. after health check failure
const calls = (startGwBlock[0].match(/destroyGateway\(\)/g) || []).length;
expect(calls).toBeGreaterThanOrEqual(3);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Regex may capture an incomplete function body if nested blocks exist.

The non-greedy [\s\S]*? with ^} stops at the first closing brace at line start. If startGateway contains nested blocks where a } happens to be unindented, the regex will truncate early and miss later destroyGateway() calls—causing a false failure.

Also, onboard.js is read twice (here and line 18); consider reading once and reusing.

♻️ Suggested improvement
+const onboardContent = fs.readFileSync(path.join(ROOT, "bin/lib/onboard.js"), "utf-8");
+
 describe("gateway cleanup: Docker volumes removed on failure (`#17`)", () => {
   it("onboard.js: destroyGateway() removes Docker volumes", () => {
-    const content = fs.readFileSync(path.join(ROOT, "bin/lib/onboard.js"), "utf-8");
     expect(
-      content.includes("docker volume") && content.includes("openshell-cluster"),
+      onboardContent.includes("docker volume") && onboardContent.includes("openshell-cluster"),
     ).toBe(true);
   });

   it("onboard.js: volume cleanup runs on gateway start failure", () => {
-    const content = fs.readFileSync(path.join(ROOT, "bin/lib/onboard.js"), "utf-8");
-    // The startGateway function should call destroyGateway after a failed start
-    const startGwBlock = content.match(/async function startGateway[\s\S]*?^}/m);
+    // Match function start and count braces to find the complete body
+    const fnStart = onboardContent.indexOf("async function startGateway");
+    expect(fnStart).toBeGreaterThan(-1);
+
+    let depth = 0, started = false, end = fnStart;
+    for (let i = fnStart; i < onboardContent.length; i++) {
+      if (onboardContent[i] === "{") { depth++; started = true; }
+      else if (onboardContent[i] === "}") { depth--; }
+      if (started && depth === 0) { end = i + 1; break; }
+    }
+    const startGwBlock = onboardContent.slice(fnStart, end);
     expect(startGwBlock).toBeTruthy();

     // Count calls to destroyGateway — should be at least 3:
-    const calls = (startGwBlock[0].match(/destroyGateway\(\)/g) || []).length;
+    const calls = (startGwBlock.match(/destroyGateway\(\)/g) || []).length;
     expect(calls).toBeGreaterThanOrEqual(3);
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/gateway-cleanup.test.js` around lines 24 - 36, The regex used to extract
the async function startGateway is brittle and can stop at the first unindented
closing brace, truncating the function and missing destroyGateway() calls;
replace the regex with a more robust parser approach (or a balanced-brace
search) to reliably capture the entire startGateway function body when scanning
onboard.js, then recount destroyGateway() occurrences inside that full body;
also avoid reading onboard.js twice by reusing the previously read content
variable (content) instead of calling fs.readFileSync a second time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Docker Support for Docker containerization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ensure failed gateway startup is fully cleaned up so reruns do not require manual Docker volume deletion

3 participants