fix(onboard): clean up stale openclaw-gateway dashboard listeners#3405
Conversation
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds a stale gateway cleanup module that detects and terminates orphaned ChangesStale Gateway Cleanup
Sequence Diagram(s)sequenceDiagram
participant CLI
participant Stopper
participant LSOF
participant PS
participant Kernel
CLI->>Stopper: invoke stopStaleDashboardListeners({protectedPorts})
Stopper->>LSOF: run "lsof -iTCP:PORT_RANGE -sTCP:LISTEN -n -P"
LSOF-->>Stopper: return ports and PIDs
Stopper->>PS: run "ps -o user= -o args= -p PID"
PS-->>Stopper: return user and cmdline
Stopper->>Kernel: send SIGTERM to PID
Kernel-->>Stopper: signal delivered
Stopper->>Kernel: send SIGKILL if still alive
Kernel-->>Stopper: final exit status
Stopper-->>CLI: return CleanupResult
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryPi Semantic E2E AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/sandbox/destroy.ts`:
- Around line 116-123: The sweep currently bails out when lsof is missing and
the subsequent pid-file cleanup only recognizes "openshell-gateway", which can
leave "openclaw-gateway" processes running; update the cleanup logic (the code
path after stopStaleDashboardListeners and stopDockerDriverGatewayProcess that
calls runOpenshell(["gateway","remove", NEMOCLAW_GATEWAY_NAME], ...)) to also
detect and remove openclaw-gateway PID files and/or attempt to stop any running
openclaw-gateway process when lsof is unavailable, ensuring both gateway names
("openshell-gateway" and "openclaw-gateway") are handled in the pid-file cleanup
routine so re-exec’d openclaw-gateway PIDs cannot survive and hold dashboard
ports.
In `@src/lib/onboard/stale-gateway-cleanup.ts`:
- Around line 161-164: The code currently swallows unexpected lsof exit codes by
returning [] when result.status is not 0/1; change this to surface the failure
by throwing or logging a clear error instead. Update the branch that checks
result.status (referencing result.status) to throw a descriptive Error (or call
the module's error logger) that includes result.status plus result.stderr and/or
result.stdout so unexpected lsof failures are not silently ignored and callers
can react per the function's contract.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 879fe512-a171-4e68-bbda-140fdc05532f
📒 Files selected for processing (5)
src/lib/actions/sandbox/destroy.tssrc/lib/actions/uninstall/run-plan.tssrc/lib/onboard.tssrc/lib/onboard/stale-gateway-cleanup.test.tssrc/lib/onboard/stale-gateway-cleanup.ts
…tion Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
… errors Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
… listening sockets Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 25747630830
|
|
Targeted E2E validation requested by the E2E Advisor has passed. Run: https://github.com/NVIDIA/NemoClaw/actions/runs/25747630830 Passed targeted jobs: PR review can proceed with that E2E signal in mind. |
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
ericksoa
left a comment
There was a problem hiding this comment.
Reviewed the current head. Required checks are green, CodeRabbit follow-ups are resolved in code, and the remaining cancelled Pi semantic recommendation check is non-required. Approving for squash merge.
Summary
When
openshell forward startis killed unexpectedly (failed onboard, container crash mid-build, upgrade across versions), the host-side process that holds the NemoClaw dashboard port can survive. The nextnemoclaw onboarddetects the conflict, falls back to a different port, and bakes the wrong port intoCHAT_UI_URL— the new sandbox builds but is never reachable. This PR adds a conservative, ownership- and cmdline-checked sweep of the dashboard port range and wires it into the three lifecycle entry points where it can recover safely, with aprotectedPortsopt-out so live forwards of other sandboxes are never disrupted.Related Issue
Fixes #3397.
Fixes #3398.
Changes
src/lib/onboard/stale-gateway-cleanup.ts: ownership- and cmdline-checked scan of the dashboard port range (18789-18799) with two-phase TERM-then-KILL and bounded waits. Mirrors the proventryStopOllamaProxyPidpattern from the uninstaller. Exposes aprotectedPortsoption so callers can opt-out specific ports.lsofprobe is scoped to listening sockets via-sTCP:LISTENto avoid signalling processes that merely have an in-flight client connection to the port.lsofexit statuses (anything other than 0/1) now surface throughdeps.warnso cleanup failures are visible instead of being swallowed.cleanupGatewayAfterLastSandboxinsrc/lib/actions/sandbox/destroy.tsafter the cooperativeopenshell forward stop, into theStopping servicesstep insrc/lib/actions/uninstall/run-plan.tsbetweenstopMatchingPidsandstopOrphanedOpenShell, and into the--freshbranch ofsrc/lib/onboard.tsbefore preflight so a fresh install never has to fight a leftover process from the previous attempt.--freshcall passes the dashboard ports of OTHER registered sandboxes asprotectedPorts, only unprotecting the explicit target when the operator passes--name(orNEMOCLAW_SANDBOX_NAME). Without an explicit target name we cannot tell which sandbox the user means to rebuild, so all registered ports stay protected and the upgrade/same-name recovery becomes opt-in via--name <existing>.isDockerDriverGatewayPidinsrc/lib/actions/sandbox/destroy.tsto matchopenclaw-gatewaycmdlines as well asopenshell-gateway, so the pid-file cleanup path covers the re-exec'd process name whenlsofis unavailable.openclaw-gateway/openshell forwardmarkers are killed; the scan degrades silently whenlsofis unavailable.src/lib/onboard/stale-gateway-cleanup.test.ts:lsofmissing -> no-op; no listeners in range -> no-op; user-ownedopenclaw-gateway-> killed; SIGTERM ignored -> SIGKILL escalation; foreign-owned -> skipped; cmdline mismatch -> skipped; same PID on multiple ports -> looked up once; protected port -> skipped without probing; protected PID on unprotected port -> still skipped.Port already in usesection ofdocs/reference/troubleshooting.md.Type of Change
Verification
Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
Bug Fixes
Tests
Documentation