fix(uninstall): kill leftover host openshell-gateway on cleanup (#3516)#4223
Conversation
…VIDIA#3516) PR NVIDIA#3554 stopped the host gateway process during `nemoclaw uninstall`, but direct `openshell gateway destroy` (and the sandbox-destroy `--cleanup-gateway` flow) on Linux Docker-driver mode can still leave `/usr/local/bin/openshell-gateway` running on port 8080. Consolidate the cleanup logic in a shared helper used by uninstall, sandbox destroy, and onboard, and update recovery hints to surface a `sudo pkill -f openshell-gateway` remediation alongside the existing `openshell gateway remove` / `destroy` verbs. Key changes: - New src/lib/onboard/host-gateway-process.ts: shared stopper with PID-file primary path, opt-out pgrep fallback for orphans, Docker compat parent recognition (argv0 == docker AND mount-path token), bounded SIGTERM/SIGKILL with verified exit via `ps`, sudo remediation log when a privileged PID survives, and a warning when pgrep is unavailable so an uninstaller never claims success while an orphan is still bound. - src/lib/actions/uninstall/run-plan.ts: invoke the shared helper during the "Stopping services" step. - src/lib/actions/sandbox/destroy.ts: replace the local PID-file probe with the shared helper and disable the pgrep sweep so a same-host gateway run for an unrelated project is not torn down by `destroy --cleanup-gateway`. - src/lib/onboard.ts: route `stopDockerDriverGatewayProcess` and the drift restart through the shared helper (drift restart targets only the supplied PID); update the final-failure recovery banner to add the sudo pkill line on Linux while keeping both lifecycle verbs. - Recovery hint helpers (gpu-recovery.ts, gateway-gpu-passthrough.ts) and user-facing docs (`docs/manage-sandboxes/lifecycle.mdx`, `docs/reference/commands.mdx`, `docs/reference/troubleshooting.mdx`, `scripts/install.sh`) tell users to run `gateway remove` first with the legacy `gateway destroy -g` fallback, then `sudo pkill -f openshell-gateway` for stuck privileged processes. Tests: - New unit tests cover the PID-file path, pgrep fallback, false-positive rejection (random `vim /opt/nemoclaw/openshell-gateway`, unrelated cmdlines mentioning openshell-gateway), Docker compat parent matching, pgrep-skip behaviour when explicit PIDs are passed, and the pgrep-unavailable warning. - Existing uninstall + onboard tests updated for the new copy. - E2E verified by spawning a fake `/tmp/.../openshell-gateway` that binds a unique port and confirming both PID-file-only and pgrep-only branches stop it via the built dist/ entry point. Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3ae6a65 to
1e834bd
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughCentralizes host-side openshell-gateway termination into stopHostGatewayProcesses, updates uninstall/destroy/onboard to use it, adds sudo pkill remediation guidance in docs and error messages, and extends tests/CLI to validate deterministic gateway-cleanup behavior. ChangesHost gateway process management consolidation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
…VIDIA#3516) The previous mock PIDs (7777, 99887, 42, 111, 222, 123, 456, 5151, 6262) could collide with real /proc entries on a busy CI runner. When that happens, `processArgs` reads the real `/proc/<pid>/cmdline` (its first choice, before the ps mock) and `hostGatewayCmdlineMatches` rejects the foreign cmdline, so the candidate gets routed to `skippedNonMatchingPids` instead of `stopped`. The GitHub "basic checks" runner hit this on PID 7777 and the drift-restart test failed with `expected [] to deeply equal [ 7777 ]`. Use PIDs above the default kernel `pid_max` (4194304) so `/proc/<pid>/cmdline` never exists and the ps mock wins deterministically. Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard/host-gateway-process.test.ts`:
- Around line 226-253: The test "skips pgrep sweep when explicit PIDs are passed
(drift restart)" is non-deterministic because real /proc/<pid>/cmdline can be
read for PID 7777; update the test so stopHostGatewayProcesses always uses the
mocked ps response by stubbing FS access for that proc path: before calling
stopHostGatewayProcesses, stub fs.readFileSync (or fs.existsSync/statSync) for
`/proc/7777/cmdline` to throw ENOENT (or return an empty/non-matching value),
ensuring the code path that queries psResponses (via run/tracedRun and
psResponses) is exercised; reference symbols: the test name, tracedRun, run,
psResponses and stopHostGatewayProcesses so you change only the test setup to
stub FS access for that specific proc path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4cb7b1dc-6097-479f-baa1-b7844c0e8d24
📒 Files selected for processing (15)
docs/manage-sandboxes/lifecycle.mdxdocs/reference/commands.mdxdocs/reference/troubleshooting.mdxscripts/install.shsrc/lib/actions/sandbox/destroy.tssrc/lib/actions/uninstall/run-plan.test.tssrc/lib/actions/uninstall/run-plan.tssrc/lib/onboard.tssrc/lib/onboard/gateway-gpu-passthrough.tssrc/lib/onboard/gpu-recovery.test.tssrc/lib/onboard/gpu-recovery.tssrc/lib/onboard/host-gateway-process.test.tssrc/lib/onboard/host-gateway-process.tstest/cli.test.tstest/gateway-final-failure-cleanup.test.ts
Summary
Consolidates host-side
openshell-gatewaycleanup into a single shared helper used by uninstall, sandbox destroy, and onboard so directopenshell gateway destroy(and the destroy--cleanup-gatewayflow) can no longer leave/usr/local/bin/openshell-gatewayrunning on port 8080 on Linux Docker-driver hosts. Surfaces an actionablesudo pkill -f openshell-gatewayremediation when a privileged PID survives both lifecycle verbs.Related Issue
Closes #3516.
Changes
src/lib/onboard/host-gateway-process.ts— shared stopper with PID-file primary path, opt-outpgrepfallback for orphans, Docker-compat parent recognition (argv0 == dockerAND mount-path token), bounded SIGTERM/SIGKILL with verified exit viaps, sudo remediation log on root-owned survivors, and a warning whenpgrepis unavailable so an uninstaller never reports clean teardown while an orphan is still bound.src/lib/actions/uninstall/run-plan.ts— invokes the shared helper during the "Stopping services" step (the broader pgrep sweep is intentional for uninstall).src/lib/actions/sandbox/destroy.ts— replaces the local PID-file probe with the shared helper and disables the pgrep sweep sodestroy --cleanup-gatewaycannot terminate an unrelated host's openshell-gateway.src/lib/onboard.ts— routesstopDockerDriverGatewayProcessand the drift restart through the shared helper (drift restart targets only the supplied PID); adds thesudo pkill -f openshell-gatewayline to the final-failure recovery banner on Linux while keeping both lifecycle verbs.gpu-recovery.ts,gateway-gpu-passthrough.ts) and user-facing docs (docs/manage-sandboxes/lifecycle.mdx,docs/reference/commands.mdx,docs/reference/troubleshooting.mdx,scripts/install.sh) — instruct users to runopenshell gateway removefirst with the legacyopenshell gateway destroy -gfallback, thensudo pkill -f openshell-gatewayfor stuck privileged processes.Type of Change
Verification
vitest run src/lib/onboard/host-gateway-process.test.ts src/lib/actions/uninstall/run-plan.test.ts test/gateway-final-failure-cleanup.test.ts src/lib/onboard/gpu-recovery.test.ts src/lib/onboard/gateway-gpu-passthrough.test.ts test/cli.test.ts— 42 affected tests pass).npm run typecheck:cliclean.npm run build:cliclean.vim /opt/nemoclaw/openshell-gatewayand node cmdline mentioning openshell-gateway), Docker-compat parent matching, pgrep-skip when explicit PIDs are passed, pgrep-unavailable warning, and the existing uninstall-[Ubuntu 24.04][Install] nemoclaw uninstall does not kill running openshell-gateway host process — port 8080 leaks after uninstall #3516 path./tmp/.../openshell-gatewayPython listener on a unique port and confirming both PID-file-only (with pgrep disabled) and pgrep-only branches stop it via the builtdist/lib/onboard/host-gateway-process.js.Full
npm testwas attempted but produced environment-induced timeouts (system load 24 on 4 cores from parallel sessions); individual reruns of every test file in the change set pass.Signed-off-by: Yimo Jiang yimoj@nvidia.com
🤖 Generated with Claude Code
Summary by CodeRabbit
Documentation
Improvements
Tests