fix(inference): restart the vLLM container after host reboot (#4886) by latenighthackathon · Pull Request #4904 · NVIDIA/NemoClaw

latenighthackathon · 2026-06-07T05:50:10Z

Summary

The long-lived vLLM inference container is started with docker run -d and no restart policy, so after a host reboot or a Docker daemon restart the inference container does not come back up. nemoclaw inference get then fails and the only recovery path is re-running nemoclaw onboard --fresh --gpu. This adds --restart unless-stopped to the vLLM run command so the container is brought back automatically, covering the DGX Spark, DGX Station, and generic Linux profiles in one place. It matches the restart-policy handling already used for the GPU-patched gateway container in docker-gpu-patch.ts.

Related Issue

Closes #4886

Changes

Extract the vLLM docker run construction into an exported buildVllmRunCommand() helper in src/lib/inference/vllm.ts.
Add --restart unless-stopped to that command so the container survives a host reboot or Docker daemon restart.
Add unit tests asserting the restart policy is present and that profile run flags and image are preserved.

Type of Change

Bug fix (non-breaking change which fixes an issue)

Verification

New and existing unit tests pass locally.

Ran: npx vitest run src/lib/inference/vllm.test.ts (9 passed), npx tsc -p tsconfig.src.json, npm run typecheck:cli, and npx @biomejs/biome lint src/lib/inference/vllm.ts (all clean).

Signed-off-by: latenighthackathon latenighthackathon@users.noreply.github.com

Summary by CodeRabbit

Release Notes

Tests
- Added test coverage for vLLM container command generation and validation.
Refactor
- Enhanced vLLM container resilience with automatic restart policy on failure.

…4886) The long-lived nemoclaw-vllm container is started with `docker run -d` and no restart policy, so after a host reboot or a Docker daemon restart the inference container does not come back. `nemoclaw inference get` then fails and the only recovery is re-running `nemoclaw onboard --fresh --gpu`. Add `--restart unless-stopped` to the vLLM `docker run` command so the container is brought back up automatically, covering the Spark, Station, and generic Linux profiles in one place. This matches the restart-policy handling already used for the GPU-patched gateway container in docker-gpu-patch.ts. The run-command construction is extracted into an exported buildVllmRunCommand() helper so the policy is unit-testable. Closes NVIDIA#4886 Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>

coderabbitai · 2026-06-07T05:50:22Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ec333df6-f6a2-4403-a454-654658ffad90

📥 Commits

Reviewing files that changed from the base of the PR and between d7aa5e0 and 5dfde7d.

📒 Files selected for processing (2)

src/lib/inference/vllm.test.ts
src/lib/inference/vllm.ts

📝 Walkthrough

Walkthrough

Extracted vLLM docker run command construction into a testable helper function buildVllmRunCommand that includes the critical --restart unless-stopped policy. Updated startContainer to use this helper and added comprehensive test coverage validating command structure, flags, image selection, and entrypoint configuration.

Changes

vLLM restart policy extraction

Layer / File(s)	Summary
vLLM run command helper and integration `src/lib/inference/vllm.ts`	New exported `buildVllmRunCommand(profile, model, runFlags)` centralizes docker run command construction with `--restart unless-stopped`, port binding, entrypoint, and serve-command wiring. `startContainer` refactored to call the helper instead of assembling the command inline.
vLLM run command tests `src/lib/inference/vllm.test.ts`	Import `buildVllmRunCommand` and add test suite validating the generated command includes `--restart unless-stopped`, correct container name and port, preserves custom run flags, includes the selected vLLM image, and sets `/bin/bash` entrypoint.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

area: inference, bug-fix, Docker, Provider: vLLM, v0.0.60

Poem

A rabbit hops through Docker's logs,
And finds the container-stopping cogs—
restart unless-stopped, now that's the call!
After reboot, vLLM won't fall,
The inference flows, no gaps at all! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding the --restart unless-stopped policy to ensure vLLM container restarts after host reboot, which is the core fix.
Linked Issues check	✅ Passed	The code changes fully address the linked issue `#4886` by implementing --restart unless-stopped for vLLM container persistence across reboots, extracting docker run construction, and including comprehensive unit tests.
Out of Scope Changes check	✅ Passed	All changes are strictly scoped to the stated objective: extracting buildVllmRunCommand helper, adding restart policy, and testing. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

## Summary - Add the v0.0.61 release notes from the GitHub dev announcement. - Document managed vLLM recovery after host reboot and Slack denied-mention feedback. - Refresh generated `nemoclaw-user-*` skills from the source docs. ## Source summary - #4983 -> `docs/about/release-notes.mdx`: Added the v0.0.61 release summary from the dev announcement and linked behavior groups to deeper docs. - #4904 -> `docs/inference/use-local-inference.mdx`: Documented that managed vLLM restarts the `nemoclaw-vllm` container after host reboot during recovery. - #4933 -> `docs/manage-sandboxes/messaging-channels.mdx`: Documented Slack sender feedback for denied channel `@mention` events. - #4879, #4915, #4935, #4759, #4164, #4888, #4897, #4944, #4959 -> `.agents/skills/`: Refreshed generated user skills from the current source docs for release prep. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` (passed outside the tool sandbox after `tsx` IPC pipe creation was blocked in the sandbox) - `npm run build:cli` (refreshed local `dist/` for the pre-push TypeScript hook) - Commit and pre-push hooks passed, including docs-to-skills verification, markdownlint, gitleaks, skills YAML tests, and CLI TypeScript.  ## Summary by CodeRabbit * **Documentation** * Updated sandbox security documentation with file descriptor limits. * Changed default inference model for DGX Station profile. * Enhanced agent policy and backup/restore documentation. * Improved command reference examples with clearer formatting. * Clarified Slack messaging denial notice behavior. * Added automatic vLLM container recovery during host reboot. * Updated release notes for v0.0.61.

cv approved these changes Jun 7, 2026

View reviewed changes

cv merged commit 8fec8fa into NVIDIA:main Jun 7, 2026
38 checks passed

cv added the v0.0.61 Release target label Jun 7, 2026

latenighthackathon deleted the fix/4886-vllm-restart-policy branch June 7, 2026 16:30

miyoungc mentioned this pull request Jun 8, 2026

docs: refresh v0.0.61 release docs #4992

Merged

wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(inference): restart the vLLM container after host reboot (#4886)#4904

fix(inference): restart the vLLM container after host reboot (#4886)#4904
cv merged 1 commit into
NVIDIA:mainfrom
latenighthackathon:fix/4886-vllm-restart-policy

latenighthackathon commented Jun 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 7, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

latenighthackathon commented Jun 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

latenighthackathon commented Jun 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 7, 2026 •

edited

Loading