Skip to content

fix(uninstall): kill leftover host openshell-gateway on cleanup (#3516)#4223

Merged
cv merged 2 commits into
NVIDIA:mainfrom
yimoj:fix/3516-host-gateway-cleanup
May 29, 2026
Merged

fix(uninstall): kill leftover host openshell-gateway on cleanup (#3516)#4223
cv merged 2 commits into
NVIDIA:mainfrom
yimoj:fix/3516-host-gateway-cleanup

Conversation

@yimoj

@yimoj yimoj commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Consolidates host-side openshell-gateway cleanup into a single shared helper used by uninstall, sandbox destroy, and onboard so direct openshell gateway destroy (and the destroy --cleanup-gateway flow) can no longer leave /usr/local/bin/openshell-gateway running on port 8080 on Linux Docker-driver hosts. Surfaces an actionable sudo pkill -f openshell-gateway remediation when a privileged PID survives both lifecycle verbs.

Related Issue

Closes #3516.

Changes

  • New src/lib/onboard/host-gateway-process.ts — shared stopper with PID-file primary path, opt-out pgrep fallback for orphans, Docker-compat parent recognition (argv0 == docker AND mount-path token), bounded SIGTERM/SIGKILL with verified exit via ps, sudo remediation log on root-owned survivors, and a warning when pgrep is unavailable so an uninstaller never reports clean teardown while an orphan is still bound.
  • src/lib/actions/uninstall/run-plan.ts — invokes the shared helper during the "Stopping services" step (the broader pgrep sweep is intentional for uninstall).
  • src/lib/actions/sandbox/destroy.ts — replaces the local PID-file probe with the shared helper and disables the pgrep sweep so destroy --cleanup-gateway cannot terminate an unrelated host's openshell-gateway.
  • src/lib/onboard.ts — routes stopDockerDriverGatewayProcess and the drift restart through the shared helper (drift restart targets only the supplied PID); adds the sudo pkill -f openshell-gateway line to the final-failure recovery banner on Linux while keeping both lifecycle verbs.
  • Recovery hints (gpu-recovery.ts, gateway-gpu-passthrough.ts) and user-facing docs (docs/manage-sandboxes/lifecycle.mdx, docs/reference/commands.mdx, docs/reference/troubleshooting.mdx, scripts/install.sh) — instruct users to run openshell gateway remove first with the legacy openshell gateway destroy -g fallback, then sudo pkill -f openshell-gateway for stuck privileged processes.

Type of Change

  • Code change with doc updates

Verification

  • Targeted unit tests pass (vitest run src/lib/onboard/host-gateway-process.test.ts src/lib/actions/uninstall/run-plan.test.ts test/gateway-final-failure-cleanup.test.ts src/lib/onboard/gpu-recovery.test.ts src/lib/onboard/gateway-gpu-passthrough.test.ts test/cli.test.ts — 42 affected tests pass).
  • npm run typecheck:cli clean.
  • npm run build:cli clean.
  • Tests added for: PID-file path, pgrep fallback, false-positive rejection (vim /opt/nemoclaw/openshell-gateway and node cmdline mentioning openshell-gateway), Docker-compat parent matching, pgrep-skip when explicit PIDs are passed, pgrep-unavailable warning, and the existing uninstall-[Ubuntu 24.04][Install] nemoclaw uninstall does not kill running openshell-gateway host process — port 8080 leaks after uninstall #3516 path.
  • E2E verified by spawning a fake /tmp/.../openshell-gateway Python listener on a unique port and confirming both PID-file-only (with pgrep disabled) and pgrep-only branches stop it via the built dist/lib/onboard/host-gateway-process.js.
  • No secrets committed.
  • Docs updated for user-facing behavior changes.

Full npm test was attempted but produced environment-induced timeouts (system load 24 on 4 cores from parallel sessions); individual reruns of every test file in the change set pass.


Signed-off-by: Yimo Jiang yimoj@nvidia.com

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation

    • Updated upgrade, onboarding, and troubleshooting guidance with clearer gateway cleanup steps and an added privileged-process remediation suggestion.
  • Improvements

    • More robust and conservative host gateway shutdown and cleanup behavior with clearer user-facing messages.
  • Tests

    • Added and adjusted tests to validate gateway shutdown paths and related messaging.

Review Change Stack

…VIDIA#3516)

PR NVIDIA#3554 stopped the host gateway process during `nemoclaw uninstall`, but
direct `openshell gateway destroy` (and the sandbox-destroy `--cleanup-gateway`
flow) on Linux Docker-driver mode can still leave `/usr/local/bin/openshell-gateway`
running on port 8080. Consolidate the cleanup logic in a shared helper used by
uninstall, sandbox destroy, and onboard, and update recovery hints to surface a
`sudo pkill -f openshell-gateway` remediation alongside the existing
`openshell gateway remove` / `destroy` verbs.

Key changes:
- New src/lib/onboard/host-gateway-process.ts: shared stopper with PID-file
  primary path, opt-out pgrep fallback for orphans, Docker compat parent
  recognition (argv0 == docker AND mount-path token), bounded
  SIGTERM/SIGKILL with verified exit via `ps`, sudo remediation log when a
  privileged PID survives, and a warning when pgrep is unavailable so an
  uninstaller never claims success while an orphan is still bound.
- src/lib/actions/uninstall/run-plan.ts: invoke the shared helper during the
  "Stopping services" step.
- src/lib/actions/sandbox/destroy.ts: replace the local PID-file probe with
  the shared helper and disable the pgrep sweep so a same-host gateway run
  for an unrelated project is not torn down by `destroy --cleanup-gateway`.
- src/lib/onboard.ts: route `stopDockerDriverGatewayProcess` and the drift
  restart through the shared helper (drift restart targets only the supplied
  PID); update the final-failure recovery banner to add the sudo pkill line
  on Linux while keeping both lifecycle verbs.
- Recovery hint helpers (gpu-recovery.ts, gateway-gpu-passthrough.ts) and
  user-facing docs (`docs/manage-sandboxes/lifecycle.mdx`,
  `docs/reference/commands.mdx`, `docs/reference/troubleshooting.mdx`,
  `scripts/install.sh`) tell users to run `gateway remove` first with the
  legacy `gateway destroy -g` fallback, then `sudo pkill -f openshell-gateway`
  for stuck privileged processes.

Tests:
- New unit tests cover the PID-file path, pgrep fallback, false-positive
  rejection (random `vim /opt/nemoclaw/openshell-gateway`, unrelated
  cmdlines mentioning openshell-gateway), Docker compat parent matching,
  pgrep-skip behaviour when explicit PIDs are passed, and the
  pgrep-unavailable warning.
- Existing uninstall + onboard tests updated for the new copy.
- E2E verified by spawning a fake `/tmp/.../openshell-gateway` that binds a
  unique port and confirming both PID-file-only and pgrep-only branches
  stop it via the built dist/ entry point.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yimoj yimoj force-pushed the fix/3516-host-gateway-cleanup branch from 3ae6a65 to 1e834bd Compare May 26, 2026 07:06
@coderabbitai

coderabbitai Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 297d2555-7a5d-4286-ab5b-ae487385925e

📥 Commits

Reviewing files that changed from the base of the PR and between 1e834bd and 4a15468.

📒 Files selected for processing (2)
  • src/lib/actions/uninstall/run-plan.test.ts
  • src/lib/onboard/host-gateway-process.test.ts

📝 Walkthrough

Walkthrough

Centralizes host-side openshell-gateway termination into stopHostGatewayProcesses, updates uninstall/destroy/onboard to use it, adds sudo pkill remediation guidance in docs and error messages, and extends tests/CLI to validate deterministic gateway-cleanup behavior.

Changes

Host gateway process management consolidation

Layer / File(s) Summary
Host gateway process helper and test suite
src/lib/onboard/host-gateway-process.ts, src/lib/onboard/host-gateway-process.test.ts
New module exports typed stopHostGatewayProcesses; resolves state dir/pidfile, discovers PIDs via file or pgrep, verifies cmdline/docker-compat, attempts SIGTERM→SIGKILL with polling, clears stale markers, and emits sudo-remediation guidance. Comprehensive Vitest suite validates behaviors.
Uninstall and sandbox destroy integration
src/lib/actions/uninstall/run-plan.ts, src/lib/actions/uninstall/run-plan.test.ts, src/lib/actions/sandbox/destroy.ts
Uninstall "Stopping services" and sandbox destroy's cleanup now call stopHostGatewayProcesses; tests use a shared exited set and ps stubs, and expected logs updated. Removed local PID-file helpers and node:os import from destroy.ts.
Onboard gateway stop and restart paths
src/lib/onboard.ts
Onboard's Docker-driver gateway stop and drift-restart now delegate to hostGatewayProcess.stopHostGatewayProcesses; final-start-failure diagnostics include a Linux-only sudo pkill -f openshell-gateway suggestion.
GPU recovery messages and manual cleanup documentation
src/lib/onboard/gpu-recovery.ts, src/lib/onboard/gpu-recovery.test.ts, src/lib/onboard/gateway-gpu-passthrough.ts, docs/manage-sandboxes/lifecycle.mdx, docs/reference/commands.mdx, docs/reference/troubleshooting.mdx, scripts/install.sh
Recovery hints, troubleshooting, and install manual-upgrade guidance now include sudo pkill -f openshell-gateway and recommend openshell gateway remove nemoclaw before destroy fallbacks; tests updated to assert presence/absence of the pkill guidance.
CLI and platform-specific test strengthening
test/cli.test.ts, test/gateway-final-failure-cleanup.test.ts
Destroy tests stub pgrep/lsof to ensure deterministic gateway cleanup, add explicit 30s timeouts to runs, and conditionally assert the sudo pkill remediation on Linux only.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3965: Related GPU passthrough recovery message changes and branching logic (both touch gpu-recovery message generation).

Suggested labels

fix, Sandbox, OpenShell, NV QA, v0.0.51

Suggested reviewers

  • ericksoa

Poem

🐰 I hopped through logs and PID-file trails,

I nudged the docs and fixed the rails,
A pkill hint, a gentler stop—
Now orphaned gateways cease to hop,
Port 8080 sleeps without fails.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: fixing uninstall to properly kill leftover host openshell-gateway processes on cleanup, directly addressing issue #3516.
Linked Issues check ✅ Passed The PR comprehensively addresses all coding objectives from issue #3516: it implements host-process gateway detection and termination via a new shared helper module, integrates it into uninstall/destroy/onboard flows, provides sudo remediation messaging, and includes extensive test coverage.
Out of Scope Changes check ✅ Passed All changes are tightly scoped to addressing issue #3516: new host-gateway-process module, updates to lifecycle commands, test coverage, and user documentation—all directly supporting the core objective of reliably detecting and stopping host-mode openshell-gateway processes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

…VIDIA#3516)

The previous mock PIDs (7777, 99887, 42, 111, 222, 123, 456, 5151, 6262)
could collide with real /proc entries on a busy CI runner. When that
happens, `processArgs` reads the real `/proc/<pid>/cmdline` (its first
choice, before the ps mock) and `hostGatewayCmdlineMatches` rejects the
foreign cmdline, so the candidate gets routed to `skippedNonMatchingPids`
instead of `stopped`. The GitHub "basic checks" runner hit this on PID
7777 and the drift-restart test failed with `expected [] to deeply equal
[ 7777 ]`.

Use PIDs above the default kernel `pid_max` (4194304) so `/proc/<pid>/cmdline`
never exists and the ps mock wins deterministically.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/host-gateway-process.test.ts`:
- Around line 226-253: The test "skips pgrep sweep when explicit PIDs are passed
(drift restart)" is non-deterministic because real /proc/<pid>/cmdline can be
read for PID 7777; update the test so stopHostGatewayProcesses always uses the
mocked ps response by stubbing FS access for that proc path: before calling
stopHostGatewayProcesses, stub fs.readFileSync (or fs.existsSync/statSync) for
`/proc/7777/cmdline` to throw ENOENT (or return an empty/non-matching value),
ensuring the code path that queries psResponses (via run/tracedRun and
psResponses) is exercised; reference symbols: the test name, tracedRun, run,
psResponses and stopHostGatewayProcesses so you change only the test setup to
stub FS access for that specific proc path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4cb7b1dc-6097-479f-baa1-b7844c0e8d24

📥 Commits

Reviewing files that changed from the base of the PR and between 206737f and 1e834bd.

📒 Files selected for processing (15)
  • docs/manage-sandboxes/lifecycle.mdx
  • docs/reference/commands.mdx
  • docs/reference/troubleshooting.mdx
  • scripts/install.sh
  • src/lib/actions/sandbox/destroy.ts
  • src/lib/actions/uninstall/run-plan.test.ts
  • src/lib/actions/uninstall/run-plan.ts
  • src/lib/onboard.ts
  • src/lib/onboard/gateway-gpu-passthrough.ts
  • src/lib/onboard/gpu-recovery.test.ts
  • src/lib/onboard/gpu-recovery.ts
  • src/lib/onboard/host-gateway-process.test.ts
  • src/lib/onboard/host-gateway-process.ts
  • test/cli.test.ts
  • test/gateway-final-failure-cleanup.test.ts

Comment thread src/lib/onboard/host-gateway-process.test.ts
@wscurran

Copy link
Copy Markdown
Contributor

@cv cv added v0.0.52 Release target v0.0.53 Release target and removed v0.0.51 Release target v0.0.52 Release target labels May 26, 2026
@ericksoa ericksoa added v0.0.55 and removed v0.0.53 Release target labels May 27, 2026
@jyaunches jyaunches added R1 v0.0.56 Release target and removed v0.0.55 labels May 29, 2026
@cv cv merged commit 0d7c6d5 into NVIDIA:main May 29, 2026
23 checks passed
@wscurran wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression and removed OpenShell labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression v0.0.56 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ubuntu 24.04][Install] nemoclaw uninstall does not kill running openshell-gateway host process — port 8080 leaks after uninstall

5 participants