fix(status): explain cloudflared-stopped reason and surface Connected/Inference fields#3534
fix(status): explain cloudflared-stopped reason and surface Connected/Inference fields#3534prekshivyas wants to merge 2 commits into
Conversation
…/Inference fields `nemoclaw status` previously printed `● cloudflared (stopped)` in three distinct failure modes (no PID file, garbage PID, dead/wrong-process PID) with no cause and no remediation — exactly the symptom #2604 reported. The doctor already distinguished the three modes; the status renderer just never picked up the same logic. Extract the shared check into a new `readCloudflaredState(pidDir)` in src/lib/tunnel/services.ts that returns a discriminated union, and have both `showStatus()` and the doctor's `cloudflaredDoctorCheck` consume it. showStatus now emits a yellow/red marker plus a one-line remediation: ● cloudflared (stopped) start when needed with `nemoclaw tunnel start` ● cloudflared (stale PID file) run `nemoclaw tunnel stop` and start it again if you need a public tunnel ● cloudflared (stale PID 999999999) run `nemoclaw tunnel stop` to clean up the service state Bare `nemoclaw status` also surfaces the configured Inference (provider / model) and Connected (active-session count) as labeled fields under each sandbox row, matching what was previously only available via the per- sandbox `nemoclaw <name> status`. `getActiveSessionCount` is wired through `buildStatusCommandDeps`, mirroring the cached SSH-process probe already used by `buildListCommandDeps`. Fixes #2604 Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
📝 WalkthroughWalkthroughAdds a discriminated CloudflaredState and readCloudflaredState(), refactors showStatus()/doctor to use it with targeted remediation hints, and extends nemoclaw status output with per-sandbox Inference (provider/model) and Connected (SSH session count) lines backed by a cached SSH-process probe. ChangesStatus Diagnostics and Display
Sequence DiagramsequenceDiagram
participant User
participant StatusCmd as showStatusCommand
participant CloudState as readCloudflaredState
participant SSHProbe as getActiveSessionCount (cached)
participant Output
User->>StatusCmd: nemoclaw status
StatusCmd->>CloudState: readCloudflaredState(pidDir)
CloudState-->>StatusCmd: CloudflaredState (running/stopped/stale-pid-file/stale-pid-process)
StatusCmd->>SSHProbe: getActiveSessionCount(sandboxName)
SSHProbe-->>StatusCmd: session count | null
StatusCmd->>Output: emit Inference: provider / model
alt session count available
StatusCmd->>Output: emit Connected: yes/no (count)
end
StatusCmd-->>User: formatted status rows with hints
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
Selective E2E Results — ✅ All requested jobs passedRun: 25877325009
|
…s; run tunnel start to restart' Initial PR copied doctor's existing hint wording, which says "run `nemoclaw tunnel stop`" for stale-pid states. In the bare `nemoclaw status` context after `pkill cloudflared`, the user wants the tunnel back — they don't want a "stop" command for something already not running. Reviewer comments on #2604 (wangericnv 2026-05-11, cv 2026- 05-14) both expected the shape `no cloudflared process; restart with ...` — a cause phrase plus a single-command recovery. Reword all three failure modes to that shape and point at `nemoclaw tunnel start` (which already handles stale PID files — `isRunning()` returns false, `startService()` proceeds and overwrites the file). Apply the same wording in doctor.ts so the two diagnostic paths stay consistent. Update services.test.ts assertions to lock in the new phrasing; doctor JSON tests in cli.test.ts only assert status/detail and remain unaffected. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/tunnel/services.ts`:
- Around line 134-156: readCloudflaredState can detect a live but
non-cloudflared process (returns "stale-pid-process"/"stale-pid-file") but
stopService currently only uses isAlive(pid) before signaling, which risks
killing an unrelated process; update stopService to call
readCloudflaredState(pidDir) and only send SIGTERM/SIGKILL when the result.kind
=== "running" (use the returned pid), and treat "stale-pid-process" and
"stale-pid-file" as no-op/failure (do not signal); replace any direct
isAlive(pid) checks in stopService with this state-based guard to prevent
terminating recycled PIDs.
- Around line 127-131: The predicate commandLineNamesCloudflared currently scans
the entire argv string and can match incidental tokens; change it to match the
actual executable name (comm) instead of any argv token: update call sites
(including readCloudflaredState) to pass the process comm/command-name (or parse
the `comm` column from the ps fallback) and then compare basename(comm) ===
"cloudflared"; if you must still inspect args as a fallback, keep that as a
secondary check but prefer comm for deciding running vs stale-pid-process.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 46a52743-73cb-4501-9463-8896b74b9a5f
📒 Files selected for processing (3)
src/lib/actions/sandbox/doctor.tssrc/lib/tunnel/services.test.tssrc/lib/tunnel/services.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- src/lib/tunnel/services.test.ts
- src/lib/actions/sandbox/doctor.ts
| function commandLineNamesCloudflared(commandLine: string): boolean { | ||
| return commandLine | ||
| .split(/\0|\s+/) | ||
| .filter(Boolean) | ||
| .some((token) => basename(token) === "cloudflared"); |
There was a problem hiding this comment.
Match the executable, not any argv token.
This predicate returns true for unrelated processes whose args happen to contain the word cloudflared (for example python app.py cloudflared or sh -lc 'echo cloudflared'), so readCloudflaredState() can incorrectly report running instead of stale-pid-process.
Suggested fix
function commandLineNamesCloudflared(commandLine: string): boolean {
- return commandLine
- .split(/\0|\s+/)
- .filter(Boolean)
- .some((token) => basename(token) === "cloudflared");
+ const [argv0] = commandLine.split(/\0/).filter(Boolean);
+ if (!argv0) return false;
+ return basename(argv0) === "cloudflared";
}If you still need the ps fallback, parse comm separately and compare that column instead of scanning the whole args string.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function commandLineNamesCloudflared(commandLine: string): boolean { | |
| return commandLine | |
| .split(/\0|\s+/) | |
| .filter(Boolean) | |
| .some((token) => basename(token) === "cloudflared"); | |
| function commandLineNamesCloudflared(commandLine: string): boolean { | |
| const [argv0] = commandLine.split(/\0/).filter(Boolean); | |
| if (!argv0) return false; | |
| return basename(argv0) === "cloudflared"; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/tunnel/services.ts` around lines 127 - 131, The predicate
commandLineNamesCloudflared currently scans the entire argv string and can match
incidental tokens; change it to match the actual executable name (comm) instead
of any argv token: update call sites (including readCloudflaredState) to pass
the process comm/command-name (or parse the `comm` column from the ps fallback)
and then compare basename(comm) === "cloudflared"; if you must still inspect
args as a fallback, keep that as a secondary check but prefer comm for deciding
running vs stale-pid-process.
| export function readCloudflaredState(pidDir: string): CloudflaredState { | ||
| const pidFile = join(pidDir, "cloudflared.pid"); | ||
| if (!existsSync(pidFile)) return { kind: "stopped" }; | ||
| let raw: string; | ||
| try { | ||
| raw = readFileSync(pidFile, "utf-8").trim(); | ||
| } catch { | ||
| return { kind: "stopped" }; | ||
| } | ||
| if (raw.length === 0) return { kind: "stopped" }; | ||
| const pid = Number(raw); | ||
| if (!Number.isFinite(pid) || pid <= 0) return { kind: "stale-pid-file" }; | ||
| try { | ||
| process.kill(pid, 0); | ||
| } catch { | ||
| return { kind: "stale-pid-process", pid }; | ||
| } | ||
| const cmdline = readProcessCommandLine(pid); | ||
| if (cmdline !== null && !commandLineNamesCloudflared(cmdline)) { | ||
| return { kind: "stale-pid-process", pid }; | ||
| } | ||
| return { kind: "running", pid }; | ||
| } |
There was a problem hiding this comment.
Use this stale-PID classification before signaling the process.
readCloudflaredState() can now prove that the pidfile points at a live non-cloudflared process, but stopService() still only checks isAlive(pid) before sending SIGTERM/SIGKILL. A recycled PID can therefore still terminate an unrelated process when the user runs nemoclaw stop.
Suggested direction
function stopService(pidDir: string, name: ServiceName): void {
+ if (name === "cloudflared") {
+ const state = readCloudflaredState(pidDir);
+ if (state.kind !== "running") {
+ if (state.kind !== "stopped") {
+ removePid(pidDir, name);
+ }
+ info(`${name} was not running`);
+ return;
+ }
+ }
+
const pid = readPid(pidDir, name);
if (pid === null) {
info(`${name} was not running`);
return;
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/tunnel/services.ts` around lines 134 - 156, readCloudflaredState can
detect a live but non-cloudflared process (returns
"stale-pid-process"/"stale-pid-file") but stopService currently only uses
isAlive(pid) before signaling, which risks killing an unrelated process; update
stopService to call readCloudflaredState(pidDir) and only send SIGTERM/SIGKILL
when the result.kind === "running" (use the returned pid), and treat
"stale-pid-process" and "stale-pid-file" as no-op/failure (do not signal);
replace any direct isAlive(pid) checks in stopService with this state-based
guard to prevent terminating recycled PIDs.
|
Superseded by #3537 — same code, one squashed commit, this time properly SSH-signed (my local clone had |
Selective E2E Results — ✅ All requested jobs passedRun: 25878386121
|
…/Inference fields (#3537) > Re-signed replacement for #3534. Same code, single squashed commit, this time SSH-signed (the original branch had `commit.gpgsign=false` set as a local override on my clone — the prior PR's commits ended up unverified). Force-push is blocked on the original branch, so this is a fresh branch + new PR; #3534 will be closed in favour of this one. ## Summary Fixes #2604. Both @wangericnv (2026-05-11) and @cv (2026-05-14) re-reported the same symptom on v0.0.38 / v0.0.41 — `nemoclaw status` prints `● cloudflared (stopped)` in all three failure modes (no PID file, garbage PID, dead/wrong-process PID) with no cause and no remediation. The bare command also omits the `Connected:` / `Inference:` labeled fields the original bug requested. Both symptoms are addressed here. ## Root cause - **Cloudflared diagnostic.** `src/lib/actions/sandbox/doctor.ts` already distinguished three states and emitted matching hints. `showStatus()` in `src/lib/tunnel/services.ts` was written before that and only had `isRunning() ? ok : "(stopped)"` — every failure mode collapsed into the same un-actionable line. - **Bare-status fields.** PR #2884 added `Inference:` / `Connected:` labels to the per-sandbox `nemoclaw <name> status` but left bare `nemoclaw status` showing only the model in parens. ## What this PR does 1. **Share the cloudflared state machine.** New `readCloudflaredState(pidDir)` in `src/lib/tunnel/services.ts` returns a discriminated union `{ kind: "running" | "stopped" | "stale-pid-file" | "stale-pid-process" }`. `showStatus()` switches on it and emits a coloured marker + one-line remediation. `cloudflaredDoctorCheck()` consumes the same function and translates each state into its `DoctorCheck`, removing duplicated PID-file / `/proc/<pid>/cmdline` logic. 2. **Remediation wording — `no cloudflared process; restart with ...`.** All three failure modes lead with the cause and point at `nemoclaw tunnel start`. Both reporters asked for that exact shape. `nemoclaw tunnel start` already handles all three states because `isRunning()` returns false and `startService` proceeds and overwrites any stale PID file — so one command recovers in every case. The same wording is used in `doctor.ts` for consistency. 3. **Bare-status `Inference:` and `Connected:` lines.** `showStatusCommand` in `src/lib/inventory/index.ts` now prints labeled `Inference: <provider> / <model>` and `Connected: yes (N session) / no` under each sandbox row. Provider/model prefer live gateway values for the default sandbox (consistent with the existing `(model)` rendering, #2369). `getActiveSessionCount` is wired through `buildStatusCommandDeps`, mirroring the cached SSH-process probe already used by `buildListCommandDeps`. 4. **Tests** (15 new). Three cases each for `showStatus` failure-mode rendering, five cases for `readCloudflaredState`, seven for bare-status `Inference:` / `Connected:` rendering across live/stored/missing dep variants. All existing cli-doctor tests still pass against the refactored shared function. ## Behavioural diff Before (v0.0.41 baseline): ``` ● cloudflared (stopped) # identical output for all three failure modes ``` After: ``` ● cloudflared (stopped) no cloudflared process; run `nemoclaw tunnel start` to start it ● cloudflared (stale PID file) no cloudflared process (stored PID is invalid); run `nemoclaw tunnel start` to restart it ● cloudflared (stale PID 999999999) no cloudflared process (PID 999999999 is dead or not cloudflared); run `nemoclaw tunnel start` to restart it ``` Bare status sandbox row: ``` # Before test-sandbox * (qwen2.5:7b) :18789 # After test-sandbox * (qwen2.5:7b) :18789 Inference: ollama-local / qwen2.5:7b Connected: no ``` ## Brev reproduction — 3× per case, baseline and fix (Run on the prior PR #3534 branch, same code as here.) Fresh `n2d-standard-2` (Ubuntu 22.04 / Linux 6.8 GCP), v0.0.41 from tag, faked registry, three PID-dir states 3× each. Re-installed from this branch. <details> <summary><b>Baseline v0.0.41 — 9/9 runs reproduce the bug</b></summary> ``` ### Case: stopped — no PID file --- Run 1 --- ● cloudflared (stopped) --- Run 2 --- ● cloudflared (stopped) --- Run 3 --- ● cloudflared (stopped) ### Case: stale-pid-file — garbage PID contents --- Run 1 --- ● cloudflared (stopped) --- Run 2 --- ● cloudflared (stopped) --- Run 3 --- ● cloudflared (stopped) ### Case: stale-pid-process — PID is dead --- Run 1 --- ● cloudflared (stopped) --- Run 2 --- ● cloudflared (stopped) --- Run 3 --- ● cloudflared (stopped) ``` </details> <details> <summary><b>Fix branch — 9/9 runs emit a state-specific remediation</b></summary> Same three-case 3× harness, with the new wording. Identical output across reruns within each case — no flake. </details> ## Out of scope (filed/to-file as follow-ups) - **Auto-restart on crash.** Needs a watchdog daemon / systemd unit / `while true; cloudflared || sleep 5` shim. Real design work, separate PR. Manual `pkill cloudflared` is not preventable from inside the CLI either way; the right answer there is actionable status output, which this PR provides. - **Intermittent cloudflared exits in nightly E2E (#3494).** Different scope — CI-infra fault attribution. #3517 covers it; touches disjoint files. ## Test plan - [x] `npm run build:cli` clean - [x] `npx tsc -p tsconfig.cli.json` clean - [x] `vitest run src/lib/tunnel/services.test.ts` — 23 pass - [x] `vitest run src/lib/inventory/index.test.ts` — 31 pass - [x] `vitest run test/cli.test.ts -t doctor` — 4/4 pass (covers refactored `cloudflaredDoctorCheck`) - [x] Brev `n2d-standard-2` repro: 9/9 baseline reproduces; 9/9 fix shows the new diagnostic + remediation lines, no flake - [x] Commit SSH-signed and verified by GitHub (`reason: valid`) Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Status command now displays inference provider and model per sandbox. * Status command includes active SSH session count when available. * **Bug Fixes** * Enhanced cloudflared health checks with refined state detection (running, stopped, stale PID). * Improved remediation hints for cloudflared diagnostic messages. * **Tests** * Added tests for expanded status output with provider, model, and session information. * Added tests for cloudflared state detection across various scenarios. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3537) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Summary
Fixes #2604. The reopen by
@wangericnvon 2026-05-11 is correct —nemoclaw statusstill prints● cloudflared (stopped)with no cause and no remediation on v0.0.41, in all three failure modes. The bare command also still omits theConnected:andInference:labeled fields the original bug requested. Both symptoms are addressed here.Root cause
src/lib/actions/sandbox/doctor.ts:331-358already distinguishes three states (no PID file →stopped; PID file with garbage →stale PID file; PID with dead/wrong process →stale PID N) and emits a matchingtunnel start/tunnel stophint.showStatus()insrc/lib/tunnel/services.tswas written before that and only hadisRunning() ? ok : "(stopped)"— every failure mode collapsed into the same un-actionable line.Inference:/Connected:labels to the per-sandboxnemoclaw <name> statusbut left barenemoclaw statusshowing only the model in parens.What this PR does
Share the cloudflared state machine. New
readCloudflaredState(pidDir)insrc/lib/tunnel/services.tsreturns a discriminated union{ kind: "running" | "stopped" | "stale-pid-file" | "stale-pid-process" }.showStatus()switches on it and emits a coloured marker + one-line remediation.cloudflaredDoctorCheck()indoctor.tsnow consumes the same function and translates each state into itsDoctorCheck, removing ~50 lines of duplicated PID-file //proc/<pid>/cmdlinelogic.Bare-status
Inference:andConnected:lines.showStatusCommandinsrc/lib/inventory/index.tsnow prints labeledInference: <provider> / <model>andConnected: yes (N session) / nounder each sandbox row. Provider/model prefer live gateway values for the default sandbox (consistent with the existing(model)rendering, [NemoClaw][Linux][CLI&UX]nemoclaw list/statusshows stale model afteropenshell inference set— live gateway state is queried but result is discarded #2369).getActiveSessionCountis wired throughbuildStatusCommandDeps, mirroring the cached SSH-process probe already used bybuildListCommandDeps.Tests (15 new). Three cases each for
showStatusfailure-mode rendering, five cases forreadCloudflaredState, seven for bare-statusInference:/Connected:rendering across live/stored/missing dep variants. All existing cli-doctor tests still pass against the refactored shared function.Behavioural diff
Before (v0.0.41 baseline, on Brev
n2d-standard-2/ Ubuntu 22.04, 9 runs × 3 failure modes):After (this branch, same env, same 9 × 3 runs):
Bare status sandbox row:
Brev reproduction — 3× per case, baseline and fix
Spun up a fresh
n2d-standard-2(Ubuntu 22.04 / Linux 6.8 GCP) — closest to@wangericnv's Ubuntu 24.04 / RTX 6000 Ada repro env. Installed v0.0.41 from the tag, faked a registry entry fortest-sandbox, ran the three PID-dir states 3× each. Same script, same registry, after installing this branch. Output captured verbatim:Baseline v0.0.41 — 9/9 runs reproduce the bug
Fix branch — 9/9 runs emit a state-specific remediation
Full bare-status output with sandbox row labels:
Out of scope (filed/to-file as follow-ups)
while true; cloudflared || sleep 5shim. Real design work — not a 1-PR change. Manualpkill cloudflared(the wangericnv repro) is by definition not preventable from inside the CLI either way; the right answer there is to make the status output actionable, which this PR does.Test plan
npm run build:clicleannpx tsc -p tsconfig.cli.jsoncleanvitest run src/lib/tunnel/services.test.ts— 23 pass (15 pre-existing + 8 new)vitest run src/lib/inventory/index.test.ts— 31 pass (24 pre-existing + 7 new)vitest run test/cli.test.ts -t doctor— 4/4 pass (covers the refactoredcloudflaredDoctorCheck)n2d-standard-2repro: 9/9 baseline runs show the bug; 9/9 fix runs show the new diagnostic + remediation lines, no flakeSigned-off-by: Prekshi Vyas prekshiv@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes
Tests