Skip to content

fix(inference): restart the vLLM container after host reboot (#4886)#4904

Merged
cv merged 1 commit into
NVIDIA:mainfrom
latenighthackathon:fix/4886-vllm-restart-policy
Jun 7, 2026
Merged

fix(inference): restart the vLLM container after host reboot (#4886)#4904
cv merged 1 commit into
NVIDIA:mainfrom
latenighthackathon:fix/4886-vllm-restart-policy

Conversation

@latenighthackathon

@latenighthackathon latenighthackathon commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Summary

The long-lived vLLM inference container is started with docker run -d and no restart policy, so after a host reboot or a Docker daemon restart the inference container does not come back up. nemoclaw inference get then fails and the only recovery path is re-running nemoclaw onboard --fresh --gpu. This adds --restart unless-stopped to the vLLM run command so the container is brought back automatically, covering the DGX Spark, DGX Station, and generic Linux profiles in one place. It matches the restart-policy handling already used for the GPU-patched gateway container in docker-gpu-patch.ts.

Related Issue

Closes #4886

Changes

  • Extract the vLLM docker run construction into an exported buildVllmRunCommand() helper in src/lib/inference/vllm.ts.
  • Add --restart unless-stopped to that command so the container survives a host reboot or Docker daemon restart.
  • Add unit tests asserting the restart policy is present and that profile run flags and image are preserved.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Verification

  • New and existing unit tests pass locally.

Ran: npx vitest run src/lib/inference/vllm.test.ts (9 passed), npx tsc -p tsconfig.src.json, npm run typecheck:cli, and npx @biomejs/biome lint src/lib/inference/vllm.ts (all clean).

Signed-off-by: latenighthackathon latenighthackathon@users.noreply.github.com

Summary by CodeRabbit

Release Notes

  • Tests

    • Added test coverage for vLLM container command generation and validation.
  • Refactor

    • Enhanced vLLM container resilience with automatic restart policy on failure.

…4886)

The long-lived nemoclaw-vllm container is started with `docker run -d`
and no restart policy, so after a host reboot or a Docker daemon restart
the inference container does not come back. `nemoclaw inference get` then
fails and the only recovery is re-running `nemoclaw onboard --fresh --gpu`.

Add `--restart unless-stopped` to the vLLM `docker run` command so the
container is brought back up automatically, covering the Spark, Station,
and generic Linux profiles in one place. This matches the restart-policy
handling already used for the GPU-patched gateway container in
docker-gpu-patch.ts. The run-command construction is extracted into an
exported buildVllmRunCommand() helper so the policy is unit-testable.

Closes NVIDIA#4886

Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ec333df6-f6a2-4403-a454-654658ffad90

📥 Commits

Reviewing files that changed from the base of the PR and between d7aa5e0 and 5dfde7d.

📒 Files selected for processing (2)
  • src/lib/inference/vllm.test.ts
  • src/lib/inference/vllm.ts

📝 Walkthrough

Walkthrough

Extracted vLLM docker run command construction into a testable helper function buildVllmRunCommand that includes the critical --restart unless-stopped policy. Updated startContainer to use this helper and added comprehensive test coverage validating command structure, flags, image selection, and entrypoint configuration.

Changes

vLLM restart policy extraction

Layer / File(s) Summary
vLLM run command helper and integration
src/lib/inference/vllm.ts
New exported buildVllmRunCommand(profile, model, runFlags) centralizes docker run command construction with --restart unless-stopped, port binding, entrypoint, and serve-command wiring. startContainer refactored to call the helper instead of assembling the command inline.
vLLM run command tests
src/lib/inference/vllm.test.ts
Import buildVllmRunCommand and add test suite validating the generated command includes --restart unless-stopped, correct container name and port, preserves custom run flags, includes the selected vLLM image, and sets /bin/bash entrypoint.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

area: inference, bug-fix, Docker, Provider: vLLM, v0.0.60

Poem

A rabbit hops through Docker's logs,
And finds the container-stopping cogs—
restart unless-stopped, now that's the call!
After reboot, vLLM won't fall,
The inference flows, no gaps at all! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding the --restart unless-stopped policy to ensure vLLM container restarts after host reboot, which is the core fix.
Linked Issues check ✅ Passed The code changes fully address the linked issue #4886 by implementing --restart unless-stopped for vLLM container persistence across reboots, extracting docker run construction, and including comprehensive unit tests.
Out of Scope Changes check ✅ Passed All changes are strictly scoped to the stated objective: extracting buildVllmRunCommand helper, adding restart policy, and testing. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@cv cv merged commit 8fec8fa into NVIDIA:main Jun 7, 2026
38 checks passed
@cv cv added the v0.0.61 Release target label Jun 7, 2026
@latenighthackathon latenighthackathon deleted the fix/4886-vllm-restart-policy branch June 7, 2026 16:30
miyoungc added a commit that referenced this pull request Jun 8, 2026
## Summary
- Add the v0.0.61 release notes from the GitHub dev announcement.
- Document managed vLLM recovery after host reboot and Slack
denied-mention feedback.
- Refresh generated `nemoclaw-user-*` skills from the source docs.

## Source summary
- #4983 -> `docs/about/release-notes.mdx`: Added the v0.0.61 release
summary from the dev announcement and linked behavior groups to deeper
docs.
- #4904 -> `docs/inference/use-local-inference.mdx`: Documented that
managed vLLM restarts the `nemoclaw-vllm` container after host reboot
during recovery.
- #4933 -> `docs/manage-sandboxes/messaging-channels.mdx`: Documented
Slack sender feedback for denied channel `@mention` events.
- #4879, #4915, #4935, #4759, #4164, #4888, #4897, #4944, #4959 ->
`.agents/skills/`: Refreshed generated user skills from the current
source docs for release prep.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `npm run docs` (passed outside the tool sandbox after `tsx` IPC pipe
creation was blocked in the sandbox)
- `npm run build:cli` (refreshed local `dist/` for the pre-push
TypeScript hook)
- Commit and pre-push hooks passed, including docs-to-skills
verification, markdownlint, gitleaks, skills YAML tests, and CLI
TypeScript.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
  * Updated sandbox security documentation with file descriptor limits.
  * Changed default inference model for DGX Station profile.
  * Enhanced agent policy and backup/restore documentation.
  * Improved command reference examples with clearer formatting.
  * Clarified Slack messaging denial notice behavior.
  * Added automatic vLLM container recovery during host reboot.
  * Updated release notes for v0.0.61.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression v0.0.61 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inference is gone in DGX spark after reboot or update

3 participants