fix(inference): restart the vLLM container after host reboot (#4886)#4904
Conversation
…4886) The long-lived nemoclaw-vllm container is started with `docker run -d` and no restart policy, so after a host reboot or a Docker daemon restart the inference container does not come back. `nemoclaw inference get` then fails and the only recovery is re-running `nemoclaw onboard --fresh --gpu`. Add `--restart unless-stopped` to the vLLM `docker run` command so the container is brought back up automatically, covering the Spark, Station, and generic Linux profiles in one place. This matches the restart-policy handling already used for the GPU-patched gateway container in docker-gpu-patch.ts. The run-command construction is extracted into an exported buildVllmRunCommand() helper so the policy is unit-testable. Closes NVIDIA#4886 Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughExtracted vLLM docker run command construction into a testable helper function ChangesvLLM restart policy extraction
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
## Summary - Add the v0.0.61 release notes from the GitHub dev announcement. - Document managed vLLM recovery after host reboot and Slack denied-mention feedback. - Refresh generated `nemoclaw-user-*` skills from the source docs. ## Source summary - #4983 -> `docs/about/release-notes.mdx`: Added the v0.0.61 release summary from the dev announcement and linked behavior groups to deeper docs. - #4904 -> `docs/inference/use-local-inference.mdx`: Documented that managed vLLM restarts the `nemoclaw-vllm` container after host reboot during recovery. - #4933 -> `docs/manage-sandboxes/messaging-channels.mdx`: Documented Slack sender feedback for denied channel `@mention` events. - #4879, #4915, #4935, #4759, #4164, #4888, #4897, #4944, #4959 -> `.agents/skills/`: Refreshed generated user skills from the current source docs for release prep. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` (passed outside the tool sandbox after `tsx` IPC pipe creation was blocked in the sandbox) - `npm run build:cli` (refreshed local `dist/` for the pre-push TypeScript hook) - Commit and pre-push hooks passed, including docs-to-skills verification, markdownlint, gitleaks, skills YAML tests, and CLI TypeScript. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Updated sandbox security documentation with file descriptor limits. * Changed default inference model for DGX Station profile. * Enhanced agent policy and backup/restore documentation. * Improved command reference examples with clearer formatting. * Clarified Slack messaging denial notice behavior. * Added automatic vLLM container recovery during host reboot. * Updated release notes for v0.0.61. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
The long-lived vLLM inference container is started with
docker run -dand no restart policy, so after a host reboot or a Docker daemon restart the inference container does not come back up.nemoclaw inference getthen fails and the only recovery path is re-runningnemoclaw onboard --fresh --gpu. This adds--restart unless-stoppedto the vLLM run command so the container is brought back automatically, covering the DGX Spark, DGX Station, and generic Linux profiles in one place. It matches the restart-policy handling already used for the GPU-patched gateway container indocker-gpu-patch.ts.Related Issue
Closes #4886
Changes
docker runconstruction into an exportedbuildVllmRunCommand()helper insrc/lib/inference/vllm.ts.--restart unless-stoppedto that command so the container survives a host reboot or Docker daemon restart.Type of Change
Verification
Ran:
npx vitest run src/lib/inference/vllm.test.ts(9 passed),npx tsc -p tsconfig.src.json,npm run typecheck:cli, andnpx @biomejs/biome lint src/lib/inference/vllm.ts(all clean).Signed-off-by: latenighthackathon latenighthackathon@users.noreply.github.com
Summary by CodeRabbit
Release Notes
Tests
Refactor