fix(inference): improve managed vLLM download and launch progress#4676
Conversation
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughReplaces log-stream readiness detection with polling of the vLLM OpenAI-compatible /v1/models endpoint, simplifies VllmProfile to timeout-based fields, adds heartbeat messages during download and launch, requires curl for readiness checks, and prints container log tails on launch failures. ChangesvLLM Readiness Mechanism Replacement
Sequence Diagram(s)sequenceDiagram
participant Install as installVllm
participant Poll as waitForVllmReady
participant Endpoint as /v1/models Endpoint
participant Container as vLLM Container
participant Logs as Container Logs
Install->>Poll: start polling with timeout
Poll->>Container: docker inspect (running?)
Container-->>Poll: running or not
Poll->>Endpoint: curl /v1/models
Endpoint-->>Poll: success or network error
alt ready
Poll-->>Install: {ok: true}
else timeout reached
Poll->>Logs: tail container logs
Logs-->>Poll: recent log lines
Poll-->>Install: {ok: false} + log tail
else container stopped
Poll-->>Install: {ok: false}
else heartbeat interval
Poll-->>Poll: emit "API not ready" message
Poll->>Endpoint: retry after interval
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: None Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
PR Review AdvisorFindings: 0 needs attention, 4 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/inference/vllm.ts (1)
243-325:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftHeartbeat-only
hf downloadcan hang install forever.Unlike
pullImage(), this external download path has no watchdog or timeout at all. Ifdocker run hf download ...wedges without exiting, the Promise never settles and the new heartbeat will print indefinitely. Please add a no-output stall cutoff and tear down the one-shot download container when it trips.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/inference/vllm.ts` around lines 243 - 325, Add a stall timeout that force-terminates the docker download if there’s no output for too long: introduce a STALL_CUTOFF_MS (e.g. a multiple of MODEL_DOWNLOAD_HEARTBEAT_MS or a configured constant), and in the existing heartbeat callback check if now - lastOutputAt >= STALL_CUTOFF_MS; if so, call proc.kill() (or proc.kill('SIGKILL') if needed), emit a descriptive message, and call done({ ok: false, reason: `hf download stalled (no output for ${STALL_CUTOFF_MS}ms)` }). Ensure you only kill if proc exists, rely on the existing done() guard to avoid double-resolve, and keep clearing the heartbeat interval as done() already does; reference symbols: dockerSpawn, proc, heartbeat, lastOutputAt, MODEL_DOWNLOAD_HEARTBEAT_MS, done.
🧹 Nitpick comments (1)
test/detect-vllm-profile.test.ts (1)
75-75: 💤 Low valueConsider adding
type: "nvidia"for consistency.Other tests that retrieve the Spark profile explicitly include
type: "nvidia"alongsidespark: true(lines 24, 47, 52). For consistency and clarity, consider:- const spark = detectVllmProfile({ spark: true }); + const spark = detectVllmProfile({ spark: true, type: "nvidia" });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/detect-vllm-profile.test.ts` at line 75, The test calling detectVllmProfile currently uses detectVllmProfile({ spark: true }) which is inconsistent with other tests; update the call to pass the explicit GPU type by using detectVllmProfile({ spark: true, type: "nvidia" }) so the Spark profile assertion matches the other cases and clarifies the intended GPU backend (refer to detectVllmProfile and the spark variable in this test).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/inference/vllm.ts`:
- Around line 524-525: The hard-coded ETA message emitted by emit("Launch can
take 5-20 minutes; this is normal") under vllm startup should be derived from
the profile's loadTimeoutSec (or made open-ended); update the code that calls
emit (the emit invocations in this vllm launch flow) to compute a minutes value
from profile.loadTimeoutSec (e.g., Math.ceil(profile.loadTimeoutSec / 60)) and
emit a message that reflects that value (or emit a generic "Launch can take
several minutes to up to X minutes" or "Launch can take several minutes; this
can take longer for large models"), ensuring you reference
profile.loadTimeoutSec and the existing emit call/site when making the change.
---
Outside diff comments:
In `@src/lib/inference/vllm.ts`:
- Around line 243-325: Add a stall timeout that force-terminates the docker
download if there’s no output for too long: introduce a STALL_CUTOFF_MS (e.g. a
multiple of MODEL_DOWNLOAD_HEARTBEAT_MS or a configured constant), and in the
existing heartbeat callback check if now - lastOutputAt >= STALL_CUTOFF_MS; if
so, call proc.kill() (or proc.kill('SIGKILL') if needed), emit a descriptive
message, and call done({ ok: false, reason: `hf download stalled (no output for
${STALL_CUTOFF_MS}ms)` }). Ensure you only kill if proc exists, rely on the
existing done() guard to avoid double-resolve, and keep clearing the heartbeat
interval as done() already does; reference symbols: dockerSpawn, proc,
heartbeat, lastOutputAt, MODEL_DOWNLOAD_HEARTBEAT_MS, done.
---
Nitpick comments:
In `@test/detect-vllm-profile.test.ts`:
- Line 75: The test calling detectVllmProfile currently uses detectVllmProfile({
spark: true }) which is inconsistent with other tests; update the call to pass
the explicit GPU type by using detectVllmProfile({ spark: true, type: "nvidia"
}) so the Spark profile assertion matches the other cases and clarifies the
intended GPU backend (refer to detectVllmProfile and the spark variable in this
test).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 07d2389c-03a9-4970-aa4b-b949ffa6beed
📒 Files selected for processing (2)
src/lib/inference/vllm.tstest/detect-vllm-profile.test.ts
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
## Summary - Add the missing `v0.0.57` release-notes section with links to the detailed docs pages for command, inference, onboarding, messaging, status, installer, and policy changes. - Remove public references to docs-skip terms from source docs and regenerate the NemoClaw user skills from the current Fern MDX docs. - Carry forward generated references for the per-agent documentation split, including Hermes-specific reference files. ## Source summary - #4615 and #4653 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover host-side `sessions` and `agents` commands plus `NEMOCLAW_EXTRA_AGENTS_JSON` secondary-agent baking. - #4163, #4204, #4611, #4619, and #4676 -> `docs/about/release-notes.mdx`, `docs/inference/use-local-inference.mdx`: Release notes now cover managed vLLM progress/readiness, DGX Spark model default changes, local Ollama streaming usage, and inference route divergence warnings. - #4267, #4601, #4609, #4642, #4645, and #4661 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover UFW auto-remediation, local-inference reachability gates, gateway reuse/binding, cancel rollback, and policy selection persistence. - #4577, #4582, #4607, and #4660 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/messaging-channels.mdx`: Release notes now cover Slack validation, atomic `channels add`, WhatsApp QR diagnostics, and Slack placeholder normalization. - #4388, #4600, #4646, and #4647 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover status failure layers, paused-container hints, Docker-driver doctor behavior, and non-destructive stale-registry recovery. - #4569, #4579, and #4678 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/lifecycle.mdx`, `docs/network-policy/integration-policy-examples.mdx`: Release notes now cover installer tag pinning, PyPI `uv` policy access, and observable Jira validation. - #4632 -> `.agents/skills/`: Regenerated user skills from the current per-agent docs source, including newly generated Hermes reference files. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" docs --glob "*.mdx"` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" .agents/skills --glob "*.md"` - `npm run docs` - `npm run build:cli` - Commit hooks: markdownlint, docs-to-skills verification, gitleaks, skills YAML, commitlint <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Restructured documentation to clearly distinguish OpenClaw and Hermes agent variants throughout user guides. * Enhanced security, credential storage, and deployment guidance with clearer setup flows. * Added Hermes plugin installation and ecosystem documentation. * Improved workspace, messaging, and policy management references with variant-specific command examples. * Refined troubleshooting and CLI reference sections for clarity. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Improves managed local vLLM setup progress visibility and readiness handling. Model downloads now stream native Hugging Face progress output, and vLLM launch waits quietly on the
/v1/modelsreadiness endpoint with bounded probes and failure log tails.Changes
hf downloadoutput directly with Docker TTY enabled for live progress./v1/modelsreadiness polling.curlprereq and bounded curl timeouts for readiness checks.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com
Summary by CodeRabbit
Bug Fixes
Improvements
Tests