Skip to content

fix(status): explain cloudflared-stopped reason and surface Connected/Inference fields#3534

Closed
prekshivyas wants to merge 2 commits into
mainfrom
fix/2604-cloudflared-status-diagnostics
Closed

fix(status): explain cloudflared-stopped reason and surface Connected/Inference fields#3534
prekshivyas wants to merge 2 commits into
mainfrom
fix/2604-cloudflared-status-diagnostics

Conversation

@prekshivyas

@prekshivyas prekshivyas commented May 14, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #2604. The reopen by @wangericnv on 2026-05-11 is correct — nemoclaw status still prints ● cloudflared (stopped) with no cause and no remediation on v0.0.41, in all three failure modes. The bare command also still omits the Connected: and Inference: labeled fields the original bug requested. Both symptoms are addressed here.

Root cause

  • Cloudflared diagnostic. src/lib/actions/sandbox/doctor.ts:331-358 already distinguishes three states (no PID file → stopped; PID file with garbage → stale PID file; PID with dead/wrong process → stale PID N) and emits a matching tunnel start / tunnel stop hint. showStatus() in src/lib/tunnel/services.ts was written before that and only had isRunning() ? ok : "(stopped)" — every failure mode collapsed into the same un-actionable line.
  • Bare-status fields. PR fix(status): require verified gateway before healthy inference #2884 added Inference: / Connected: labels to the per-sandbox nemoclaw <name> status but left bare nemoclaw status showing only the model in parens.

What this PR does

  1. Share the cloudflared state machine. New readCloudflaredState(pidDir) in src/lib/tunnel/services.ts returns a discriminated union { kind: "running" | "stopped" | "stale-pid-file" | "stale-pid-process" }. showStatus() switches on it and emits a coloured marker + one-line remediation. cloudflaredDoctorCheck() in doctor.ts now consumes the same function and translates each state into its DoctorCheck, removing ~50 lines of duplicated PID-file / /proc/<pid>/cmdline logic.

  2. Bare-status Inference: and Connected: lines. showStatusCommand in src/lib/inventory/index.ts now prints labeled Inference: <provider> / <model> and Connected: yes (N session) / no under each sandbox row. Provider/model prefer live gateway values for the default sandbox (consistent with the existing (model) rendering, [NemoClaw][Linux][CLI&UX] nemoclaw list / status shows stale model after openshell inference set — live gateway state is queried but result is discarded #2369). getActiveSessionCount is wired through buildStatusCommandDeps, mirroring the cached SSH-process probe already used by buildListCommandDeps.

  3. Tests (15 new). Three cases each for showStatus failure-mode rendering, five cases for readCloudflaredState, seven for bare-status Inference: / Connected: rendering across live/stored/missing dep variants. All existing cli-doctor tests still pass against the refactored shared function.

Behavioural diff

Before (v0.0.41 baseline, on Brev n2d-standard-2 / Ubuntu 22.04, 9 runs × 3 failure modes):

● cloudflared  (stopped)        # all three modes, identical output

After (this branch, same env, same 9 × 3 runs):

● cloudflared  (stopped)
    start when needed with `nemoclaw tunnel start`

● cloudflared  (stale PID file)
    run `nemoclaw tunnel stop` and start it again if you need a public tunnel

● cloudflared  (stale PID 999999999)
    run `nemoclaw tunnel stop` to clean up the service state

Bare status sandbox row:

# Before
test-sandbox * (qwen2.5:7b) :18789

# After
test-sandbox * (qwen2.5:7b) :18789
  Inference: ollama-local / qwen2.5:7b
  Connected: no

Brev reproduction — 3× per case, baseline and fix

Spun up a fresh n2d-standard-2 (Ubuntu 22.04 / Linux 6.8 GCP) — closest to @wangericnv's Ubuntu 24.04 / RTX 6000 Ada repro env. Installed v0.0.41 from the tag, faked a registry entry for test-sandbox, ran the three PID-dir states 3× each. Same script, same registry, after installing this branch. Output captured verbatim:

Baseline v0.0.41 — 9/9 runs reproduce the bug
## nemoclaw status — `nemoclaw v0.0.41` on Linux 6.8.0-1054-gcp x86_64

### Case: stopped — no PID file
--- Run 1 ---
  ● cloudflared  (stopped)
--- Run 2 ---
  ● cloudflared  (stopped)
--- Run 3 ---
  ● cloudflared  (stopped)

### Case: stale-pid-file — garbage PID contents
--- Run 1 ---
  ● cloudflared  (stopped)
--- Run 2 ---
  ● cloudflared  (stopped)
--- Run 3 ---
  ● cloudflared  (stopped)

### Case: stale-pid-process — PID is dead
--- Run 1 ---
  ● cloudflared  (stopped)
--- Run 2 ---
  ● cloudflared  (stopped)
--- Run 3 ---
  ● cloudflared  (stopped)
Fix branch — 9/9 runs emit a state-specific remediation
## nemoclaw status — `nemoclaw v0.1.0` (fix branch) on Linux 6.8.0-1054-gcp x86_64

### Case: stopped — no PID file
--- Run 1 ---
  ● cloudflared  (stopped)
      start when needed with `nemoclaw tunnel start`
--- Run 2 ---
  ● cloudflared  (stopped)
      start when needed with `nemoclaw tunnel start`
--- Run 3 ---
  ● cloudflared  (stopped)
      start when needed with `nemoclaw tunnel start`

### Case: stale-pid-file — garbage PID contents
--- Run 1 ---
  ● cloudflared  (stale PID file)
      run `nemoclaw tunnel stop` and start it again if you need a public tunnel
--- Run 2 ---
  ● cloudflared  (stale PID file)
      run `nemoclaw tunnel stop` and start it again if you need a public tunnel
--- Run 3 ---
  ● cloudflared  (stale PID file)
      run `nemoclaw tunnel stop` and start it again if you need a public tunnel

### Case: stale-pid-process — PID is dead
--- Run 1 ---
  ● cloudflared  (stale PID 999999999)
      run `nemoclaw tunnel stop` to clean up the service state
--- Run 2 ---
  ● cloudflared  (stale PID 999999999)
      run `nemoclaw tunnel stop` to clean up the service state
--- Run 3 ---
  ● cloudflared  (stale PID 999999999)
      run `nemoclaw tunnel stop` to clean up the service state

Full bare-status output with sandbox row labels:

  Sandboxes:
    test-sandbox * (qwen2.5:7b) :18789
      Inference: ollama-local / qwen2.5:7b
      Connected: no

Out of scope (filed/to-file as follow-ups)

Test plan

  • npm run build:cli clean
  • npx tsc -p tsconfig.cli.json clean
  • vitest run src/lib/tunnel/services.test.ts — 23 pass (15 pre-existing + 8 new)
  • vitest run src/lib/inventory/index.test.ts — 31 pass (24 pre-existing + 7 new)
  • vitest run test/cli.test.ts -t doctor — 4/4 pass (covers the refactored cloudflaredDoctorCheck)
  • Brev n2d-standard-2 repro: 9/9 baseline runs show the bug; 9/9 fix runs show the new diagnostic + remediation lines, no flake

Signed-off-by: Prekshi Vyas prekshiv@nvidia.com

Summary by CodeRabbit

  • New Features

    • Status command now shows inference provider/model per sandbox and a conditional "Connected:" line with active session counts when available.
  • Bug Fixes

    • More precise cloudflared status detection with distinct states and tailored remediation hints for stopped, stale PID file, and stale PID process cases.
  • Tests

    • Added tests covering status output variations and cloudflared state branches.

Review Change Stack

…/Inference fields

`nemoclaw status` previously printed `● cloudflared (stopped)` in three
distinct failure modes (no PID file, garbage PID, dead/wrong-process PID)
with no cause and no remediation — exactly the symptom #2604 reported.
The doctor already distinguished the three modes; the status renderer
just never picked up the same logic.

Extract the shared check into a new `readCloudflaredState(pidDir)` in
src/lib/tunnel/services.ts that returns a discriminated union, and have
both `showStatus()` and the doctor's `cloudflaredDoctorCheck` consume it.
showStatus now emits a yellow/red marker plus a one-line remediation:

  ● cloudflared  (stopped)
      start when needed with `nemoclaw tunnel start`

  ● cloudflared  (stale PID file)
      run `nemoclaw tunnel stop` and start it again if you need a public tunnel

  ● cloudflared  (stale PID 999999999)
      run `nemoclaw tunnel stop` to clean up the service state

Bare `nemoclaw status` also surfaces the configured Inference (provider /
model) and Connected (active-session count) as labeled fields under each
sandbox row, matching what was previously only available via the per-
sandbox `nemoclaw <name> status`. `getActiveSessionCount` is wired
through `buildStatusCommandDeps`, mirroring the cached SSH-process probe
already used by `buildListCommandDeps`.

Fixes #2604

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 14, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Adds a discriminated CloudflaredState and readCloudflaredState(), refactors showStatus()/doctor to use it with targeted remediation hints, and extends nemoclaw status output with per-sandbox Inference (provider/model) and Connected (SSH session count) lines backed by a cached SSH-process probe.

Changes

Status Diagnostics and Display

Layer / File(s) Summary
CloudflaredState detection + tests
src/lib/tunnel/services.ts, src/lib/tunnel/services.test.ts
Adds exported CloudflaredState union and readCloudflaredState(pidDir) that classifies running, stopped, stale-pid-file, and stale-pid-process with platform cmdline checks; tests validate each branch.
showStatus() refactor and tests
src/lib/tunnel/services.ts, src/lib/tunnel/services.test.ts
showStatus() now uses readCloudflaredState() and emits state-specific status lines and remediation hints; cloudflared.log parsing is only done when state is running; tests updated for stopped/stale scenarios.
Doctor integration (cloudflared)
src/lib/actions/sandbox/doctor.ts
cloudflaredDoctorCheck() now maps readCloudflaredState(...).kind to DoctorCheck results with updated detail/hint strings for stopped and stale states.
SSH-session probing and caching
src/lib/status-command-deps.ts
buildStatusCommandDeps() conditionally creates OpenShell session deps, lazily caches getSshProcesses() per-status invocation, and exposes getActiveSessionCount which parses cached SSH output or returns null on failure.
Status command Inference & Connected lines
src/lib/inventory/index.ts, src/lib/inventory/index.test.ts
ShowStatusCommandDeps gains optional getActiveSessionCount; showStatusCommand() emits Inference: provider / model (preferring live gateway for default sandbox) and Connected: yes/no (with count and correct singular/plural) when probe is available; tests added for rendering and omission cases.

Sequence Diagram

sequenceDiagram
  participant User
  participant StatusCmd as showStatusCommand
  participant CloudState as readCloudflaredState
  participant SSHProbe as getActiveSessionCount (cached)
  participant Output

  User->>StatusCmd: nemoclaw status
  StatusCmd->>CloudState: readCloudflaredState(pidDir)
  CloudState-->>StatusCmd: CloudflaredState (running/stopped/stale-pid-file/stale-pid-process)
  StatusCmd->>SSHProbe: getActiveSessionCount(sandboxName)
  SSHProbe-->>StatusCmd: session count | null
  StatusCmd->>Output: emit Inference: provider / model
  alt session count available
    StatusCmd->>Output: emit Connected: yes/no (count)
  end
  StatusCmd-->>User: formatted status rows with hints
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • #2604: This PR implements fine-grained cloudflared state detection and remediation messaging referenced by the issue.

Possibly related PRs

  • NVIDIA/NemoClaw#3402: Modifies nemoclaw status flow and related ShowStatusCommandDeps behavior; likely related to status-output changes.

Suggested labels

NemoClaw CLI, fix, observability, Sandbox, v0.0.42

Suggested reviewers

  • jyaunches
  • cjagwani
  • cv

Poem

🐰 I hopped through PID files and proc line trails,
Found clouds that were sleeping and stale,
I counted sessions with a careful eye,
Now status shows inference and whether you're nigh,
Tunnel hints fixed — a rabbit's small exhale.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly reflects the main changes: explaining cloudflared-stopped reasons and adding Connected/Inference fields to status output.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/2604-cloudflared-status-diagnostics

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: deployment-services-e2e, diagnostics-e2e
Optional E2E: sandbox-operations-e2e, cloud-onboard-e2e

Dispatch hint: deployment-services-e2e,diagnostics-e2e

Auto-dispatched E2E: deployment-services-e2e, diagnostics-e2e via nightly-e2e.yaml at f1d8833b3bf697178ea6d5c68f147253f69094ea

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • deployment-services-e2e (high (~60 min timeout; live sandbox plus cloudflared tunnel)): Directly covers nemoclaw tunnel start, cloudflared public tunnel status visibility, and nemoclaw tunnel stop URL removal. The PR changes cloudflared state detection and nemoclaw status service output, so this is the most targeted runtime E2E.
  • diagnostics-e2e (medium/high (~45 min timeout; live sandbox)): Covers host-side diagnostic/status behavior including nemoclaw status model information after onboarding. The PR changes status output shape and status-command dependencies, so a live diagnostics pass should block merge.

Optional E2E

  • sandbox-operations-e2e (high (~60 min timeout; multi-sandbox lifecycle/recovery)): Useful broader confidence for sandbox list/connect/status/recovery flows after changes to inventory/status dependencies and active session detection, but the changed logic is mostly display/probing rather than core sandbox lifecycle mutation.
  • cloud-onboard-e2e (high (~45 min timeout; live cloud onboard)): Optional smoke of a fresh cloud OpenClaw environment with inference.local and status-adjacent diagnostics; lower targeting than deployment-services/diagnostics for this PR.

New E2E recommendations

  • doctor-cloudflared-state (medium): Existing E2E inventory did not show a nemoclaw doctor coverage hit. Add coverage that exercises doctor output for stopped, stale PID file, stale PID process, and running cloudflared states, verifying the shared remediation text from readCloudflaredState.
    • Suggested test: Extend diagnostics-e2e or add a focused diagnostics doctor E2E step for nemoclaw doctor cloudflared local-service state classification.
  • bare-status-connected-inference-lines (medium): Unit tests cover the new Inference: and Connected: lines, but existing E2E status checks mostly assert model presence or tunnel URL visibility. Add live E2E assertions for bare nemoclaw status showing provider/model and connected-session state when a sandbox SSH session is active.
    • Suggested test: Extend test/e2e/test-diagnostics.sh or test/e2e/test-sandbox-operations.sh with bare nemoclaw status assertions for Inference: and Connected: lines.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: deployment-services-e2e,diagnostics-e2e

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25877325009
Target ref: 665ce6301738b946a1f8b1cd7f208d1bd2b86351
Workflow ref: main
Requested jobs: deployment-services-e2e,diagnostics-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
deployment-services-e2e ✅ success
diagnostics-e2e ✅ success

…s; run tunnel start to restart'

Initial PR copied doctor's existing hint wording, which says "run
`nemoclaw tunnel stop`" for stale-pid states. In the bare `nemoclaw
status` context after `pkill cloudflared`, the user wants the tunnel
back — they don't want a "stop" command for something already not
running. Reviewer comments on #2604 (wangericnv 2026-05-11, cv 2026-
05-14) both expected the shape `no cloudflared process; restart with
...` — a cause phrase plus a single-command recovery.

Reword all three failure modes to that shape and point at
`nemoclaw tunnel start` (which already handles stale PID files —
`isRunning()` returns false, `startService()` proceeds and overwrites
the file). Apply the same wording in doctor.ts so the two diagnostic
paths stay consistent. Update services.test.ts assertions to lock in
the new phrasing; doctor JSON tests in cli.test.ts only assert
status/detail and remain unaffected.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/tunnel/services.ts`:
- Around line 134-156: readCloudflaredState can detect a live but
non-cloudflared process (returns "stale-pid-process"/"stale-pid-file") but
stopService currently only uses isAlive(pid) before signaling, which risks
killing an unrelated process; update stopService to call
readCloudflaredState(pidDir) and only send SIGTERM/SIGKILL when the result.kind
=== "running" (use the returned pid), and treat "stale-pid-process" and
"stale-pid-file" as no-op/failure (do not signal); replace any direct
isAlive(pid) checks in stopService with this state-based guard to prevent
terminating recycled PIDs.
- Around line 127-131: The predicate commandLineNamesCloudflared currently scans
the entire argv string and can match incidental tokens; change it to match the
actual executable name (comm) instead of any argv token: update call sites
(including readCloudflaredState) to pass the process comm/command-name (or parse
the `comm` column from the ps fallback) and then compare basename(comm) ===
"cloudflared"; if you must still inspect args as a fallback, keep that as a
secondary check but prefer comm for deciding running vs stale-pid-process.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 46a52743-73cb-4501-9463-8896b74b9a5f

📥 Commits

Reviewing files that changed from the base of the PR and between 665ce63 and f1d8833.

📒 Files selected for processing (3)
  • src/lib/actions/sandbox/doctor.ts
  • src/lib/tunnel/services.test.ts
  • src/lib/tunnel/services.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/tunnel/services.test.ts
  • src/lib/actions/sandbox/doctor.ts

Comment on lines +127 to +131
function commandLineNamesCloudflared(commandLine: string): boolean {
return commandLine
.split(/\0|\s+/)
.filter(Boolean)
.some((token) => basename(token) === "cloudflared");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Match the executable, not any argv token.

This predicate returns true for unrelated processes whose args happen to contain the word cloudflared (for example python app.py cloudflared or sh -lc 'echo cloudflared'), so readCloudflaredState() can incorrectly report running instead of stale-pid-process.

Suggested fix
 function commandLineNamesCloudflared(commandLine: string): boolean {
-  return commandLine
-    .split(/\0|\s+/)
-    .filter(Boolean)
-    .some((token) => basename(token) === "cloudflared");
+  const [argv0] = commandLine.split(/\0/).filter(Boolean);
+  if (!argv0) return false;
+  return basename(argv0) === "cloudflared";
 }

If you still need the ps fallback, parse comm separately and compare that column instead of scanning the whole args string.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function commandLineNamesCloudflared(commandLine: string): boolean {
return commandLine
.split(/\0|\s+/)
.filter(Boolean)
.some((token) => basename(token) === "cloudflared");
function commandLineNamesCloudflared(commandLine: string): boolean {
const [argv0] = commandLine.split(/\0/).filter(Boolean);
if (!argv0) return false;
return basename(argv0) === "cloudflared";
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/tunnel/services.ts` around lines 127 - 131, The predicate
commandLineNamesCloudflared currently scans the entire argv string and can match
incidental tokens; change it to match the actual executable name (comm) instead
of any argv token: update call sites (including readCloudflaredState) to pass
the process comm/command-name (or parse the `comm` column from the ps fallback)
and then compare basename(comm) === "cloudflared"; if you must still inspect
args as a fallback, keep that as a secondary check but prefer comm for deciding
running vs stale-pid-process.

Comment on lines +134 to +156
export function readCloudflaredState(pidDir: string): CloudflaredState {
const pidFile = join(pidDir, "cloudflared.pid");
if (!existsSync(pidFile)) return { kind: "stopped" };
let raw: string;
try {
raw = readFileSync(pidFile, "utf-8").trim();
} catch {
return { kind: "stopped" };
}
if (raw.length === 0) return { kind: "stopped" };
const pid = Number(raw);
if (!Number.isFinite(pid) || pid <= 0) return { kind: "stale-pid-file" };
try {
process.kill(pid, 0);
} catch {
return { kind: "stale-pid-process", pid };
}
const cmdline = readProcessCommandLine(pid);
if (cmdline !== null && !commandLineNamesCloudflared(cmdline)) {
return { kind: "stale-pid-process", pid };
}
return { kind: "running", pid };
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use this stale-PID classification before signaling the process.

readCloudflaredState() can now prove that the pidfile points at a live non-cloudflared process, but stopService() still only checks isAlive(pid) before sending SIGTERM/SIGKILL. A recycled PID can therefore still terminate an unrelated process when the user runs nemoclaw stop.

Suggested direction
 function stopService(pidDir: string, name: ServiceName): void {
+  if (name === "cloudflared") {
+    const state = readCloudflaredState(pidDir);
+    if (state.kind !== "running") {
+      if (state.kind !== "stopped") {
+        removePid(pidDir, name);
+      }
+      info(`${name} was not running`);
+      return;
+    }
+  }
+
   const pid = readPid(pidDir, name);
   if (pid === null) {
     info(`${name} was not running`);
     return;
   }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/tunnel/services.ts` around lines 134 - 156, readCloudflaredState can
detect a live but non-cloudflared process (returns
"stale-pid-process"/"stale-pid-file") but stopService currently only uses
isAlive(pid) before signaling, which risks killing an unrelated process; update
stopService to call readCloudflaredState(pidDir) and only send SIGTERM/SIGKILL
when the result.kind === "running" (use the returned pid), and treat
"stale-pid-process" and "stale-pid-file" as no-op/failure (do not signal);
replace any direct isAlive(pid) checks in stopService with this state-based
guard to prevent terminating recycled PIDs.

@prekshivyas

Copy link
Copy Markdown
Contributor Author

Superseded by #3537 — same code, one squashed commit, this time properly SSH-signed (my local clone had commit.gpgsign=false as a local override that I missed when committing). Force-push is blocked on this branch, so re-signing means a fresh branch + new PR.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25878386121
Target ref: f1d8833b3bf697178ea6d5c68f147253f69094ea
Workflow ref: main
Requested jobs: deployment-services-e2e,diagnostics-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
deployment-services-e2e ✅ success
diagnostics-e2e ✅ success

cv pushed a commit that referenced this pull request May 14, 2026
…/Inference fields (#3537)

> Re-signed replacement for #3534. Same code, single squashed commit,
this time SSH-signed (the original branch had `commit.gpgsign=false` set
as a local override on my clone — the prior PR's commits ended up
unverified). Force-push is blocked on the original branch, so this is a
fresh branch + new PR; #3534 will be closed in favour of this one.

## Summary

Fixes #2604. Both @wangericnv (2026-05-11) and @cv (2026-05-14)
re-reported the same symptom on v0.0.38 / v0.0.41 — `nemoclaw status`
prints `● cloudflared (stopped)` in all three failure modes (no PID
file, garbage PID, dead/wrong-process PID) with no cause and no
remediation. The bare command also omits the `Connected:` / `Inference:`
labeled fields the original bug requested. Both symptoms are addressed
here.

## Root cause

- **Cloudflared diagnostic.** `src/lib/actions/sandbox/doctor.ts`
already distinguished three states and emitted matching hints.
`showStatus()` in `src/lib/tunnel/services.ts` was written before that
and only had `isRunning() ? ok : "(stopped)"` — every failure mode
collapsed into the same un-actionable line.
- **Bare-status fields.** PR #2884 added `Inference:` / `Connected:`
labels to the per-sandbox `nemoclaw <name> status` but left bare
`nemoclaw status` showing only the model in parens.

## What this PR does

1. **Share the cloudflared state machine.** New
`readCloudflaredState(pidDir)` in `src/lib/tunnel/services.ts` returns a
discriminated union `{ kind: "running" | "stopped" | "stale-pid-file" |
"stale-pid-process" }`. `showStatus()` switches on it and emits a
coloured marker + one-line remediation. `cloudflaredDoctorCheck()`
consumes the same function and translates each state into its
`DoctorCheck`, removing duplicated PID-file / `/proc/<pid>/cmdline`
logic.

2. **Remediation wording — `no cloudflared process; restart with ...`.**
All three failure modes lead with the cause and point at `nemoclaw
tunnel start`. Both reporters asked for that exact shape. `nemoclaw
tunnel start` already handles all three states because `isRunning()`
returns false and `startService` proceeds and overwrites any stale PID
file — so one command recovers in every case. The same wording is used
in `doctor.ts` for consistency.

3. **Bare-status `Inference:` and `Connected:` lines.**
`showStatusCommand` in `src/lib/inventory/index.ts` now prints labeled
`Inference: <provider> / <model>` and `Connected: yes (N session) / no`
under each sandbox row. Provider/model prefer live gateway values for
the default sandbox (consistent with the existing `(model)` rendering,
#2369). `getActiveSessionCount` is wired through
`buildStatusCommandDeps`, mirroring the cached SSH-process probe already
used by `buildListCommandDeps`.

4. **Tests** (15 new). Three cases each for `showStatus` failure-mode
rendering, five cases for `readCloudflaredState`, seven for bare-status
`Inference:` / `Connected:` rendering across live/stored/missing dep
variants. All existing cli-doctor tests still pass against the
refactored shared function.

## Behavioural diff

Before (v0.0.41 baseline):

```
● cloudflared  (stopped)        # identical output for all three failure modes
```

After:

```
● cloudflared  (stopped)
    no cloudflared process; run `nemoclaw tunnel start` to start it

● cloudflared  (stale PID file)
    no cloudflared process (stored PID is invalid); run `nemoclaw tunnel start` to restart it

● cloudflared  (stale PID 999999999)
    no cloudflared process (PID 999999999 is dead or not cloudflared); run `nemoclaw tunnel start` to restart it
```

Bare status sandbox row:

```
# Before
test-sandbox * (qwen2.5:7b) :18789

# After
test-sandbox * (qwen2.5:7b) :18789
  Inference: ollama-local / qwen2.5:7b
  Connected: no
```

## Brev reproduction — 3× per case, baseline and fix

(Run on the prior PR #3534 branch, same code as here.) Fresh
`n2d-standard-2` (Ubuntu 22.04 / Linux 6.8 GCP), v0.0.41 from tag, faked
registry, three PID-dir states 3× each. Re-installed from this branch.

<details>
<summary><b>Baseline v0.0.41 — 9/9 runs reproduce the bug</b></summary>

```
### Case: stopped — no PID file
--- Run 1 ---
  ● cloudflared  (stopped)
--- Run 2 ---
  ● cloudflared  (stopped)
--- Run 3 ---
  ● cloudflared  (stopped)

### Case: stale-pid-file — garbage PID contents
--- Run 1 ---
  ● cloudflared  (stopped)
--- Run 2 ---
  ● cloudflared  (stopped)
--- Run 3 ---
  ● cloudflared  (stopped)

### Case: stale-pid-process — PID is dead
--- Run 1 ---
  ● cloudflared  (stopped)
--- Run 2 ---
  ● cloudflared  (stopped)
--- Run 3 ---
  ● cloudflared  (stopped)
```
</details>

<details>
<summary><b>Fix branch — 9/9 runs emit a state-specific
remediation</b></summary>

Same three-case 3× harness, with the new wording. Identical output
across reruns within each case — no flake.
</details>

## Out of scope (filed/to-file as follow-ups)

- **Auto-restart on crash.** Needs a watchdog daemon / systemd unit /
`while true; cloudflared || sleep 5` shim. Real design work, separate
PR. Manual `pkill cloudflared` is not preventable from inside the CLI
either way; the right answer there is actionable status output, which
this PR provides.
- **Intermittent cloudflared exits in nightly E2E (#3494).** Different
scope — CI-infra fault attribution. #3517 covers it; touches disjoint
files.

## Test plan

- [x] `npm run build:cli` clean
- [x] `npx tsc -p tsconfig.cli.json` clean
- [x] `vitest run src/lib/tunnel/services.test.ts` — 23 pass
- [x] `vitest run src/lib/inventory/index.test.ts` — 31 pass
- [x] `vitest run test/cli.test.ts -t doctor` — 4/4 pass (covers
refactored `cloudflaredDoctorCheck`)
- [x] Brev `n2d-standard-2` repro: 9/9 baseline reproduces; 9/9 fix
shows the new diagnostic + remediation lines, no flake
- [x] Commit SSH-signed and verified by GitHub (`reason: valid`)

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* Status command now displays inference provider and model per sandbox.
  * Status command includes active SSH session count when available.

* **Bug Fixes**
* Enhanced cloudflared health checks with refined state detection
(running, stopped, stale PID).
  * Improved remediation hints for cloudflared diagnostic messages.

* **Tests**
* Added tests for expanded status output with provider, model, and
session information.
* Added tests for cloudflared state detection across various scenarios.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3537)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[NemoClaw][DGX Spark][Ubuntu 24.04][CLI] nemoclaw status omits Connected/Inference fields and shows cloudflared stopped with no context

2 participants