fix(docker): replace docker pull hard timeout with progress watchdog for vllm#4163
Conversation
…for vllm Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a docker pull progress watchdog (parsing, dedupe keys, timeouts), tests the stall/max timeout semantics and output tailing, and integrates the watchdog into vLLM image pulls with an increased 12-hour safety budget. ChangesDocker Pull Watchdog
Sequence Diagram(s)sequenceDiagram
participant User
participant Code as dockerPullWithProgressWatchdog
participant Spawn as spawnImpl
participant Docker as "docker pull" child
User->>Code: dockerPullWithProgressWatchdog(imageRef, opts)
Code->>Spawn: spawn("docker", ["pull", imageRef], {cwd, env, stdio})
Spawn-->>Docker: start process
loop stream lines
Docker-->>Code: stderr/stdout lines
Code->>Code: dockerPullProgressSignature(line) -> update lastProgress when new
end
alt stall or max timeout exceeded
Code->>Docker: kill(SIGTERM) then schedule SIGKILL
Docker-->>Code: close(status 124)
Code-->>User: resolve(timeoutKind: "stall"|"max", status: 124, output)
else process exits normally
Docker-->>Code: close(exit status)
Code-->>User: resolve(status, output, timeoutKind: null)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
PR Review AdvisorFindings: 0 needs attention, 1 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/adapters/docker/pull.ts`:
- Around line 239-243: positiveMs currently floors positive fractional inputs so
values in (0,1) become 0; update positiveMs to treat any positive value less
than 1 as 1 (i.e., clamp sub-millisecond positives to at least 1) while keeping
fallbackMs for non-positive/invalid inputs — change the logic in function
positiveMs to return 1 for 0<value<1 and Math.floor(value) for value>=1,
otherwise fallbackMs.
- Around line 79-83: The BuildKit progress detection currently keys on the whole
normalized line (using normalized.match and returning `buildkit:${normalized}`),
which lets ETA/rate changes mask stalls; update the logic to key the progress
signature only on the captured transferred and total byte counters from the
regex (the capture groups produced by normalized.match into buildKitPull)
instead of the entire normalized string—e.g., return a stable signature like
`buildkit:${buildKitPull[1]}/${buildKitPull[2]}` (or otherwise normalize those
two groups) so progress is driven by bytes transferred/total rather than
ETA/rate text.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ee763d91-12f0-4720-a279-7a7f686e702f
📒 Files selected for processing (3)
src/lib/adapters/docker/pull.test.tssrc/lib/adapters/docker/pull.tssrc/lib/inference/vllm.ts
…d clamped positive sub-millisecond timeouts to 1ms Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
cv
left a comment
There was a problem hiding this comment.
Thanks for the updates. I’m still requesting changes before this merges because a few pieces of the prior feedback are not fully resolved:
-
src/lib/adapters/docker/pull.ts: BuildKit-style pull progress is no longer recognized at all. The earlier concern was that BuildKit progress should be keyed on stable byte counters rather than ETA/rate text; the current implementation removes that path, so BuildKit-only progress lines would returnnulland the watchdog could stall-kill an actively advancing pull. Please either restore BuildKit progress detection using only normalized transferred/total counters (and add/keep a regression test), or explicitly document that BuildKit-style output is unsupported and make sure no caller/test/review claim depends on it. -
src/lib/adapters/docker/pull.ts: output/progress retention is still unbounded.linesretains every flushed output line andseenProgressretains every distinct progress signature for pulls that can run for the 12-hour vLLM budget. Please bound diagnostic output to a tail and bound/compact progress tracking (for example latest signature per layer/phase or a bounded LRU) so noisy external Docker output cannot grow NemoClaw memory indefinitely. -
src/lib/inference/vllm.ts: the production vLLM wiring still lacks focused coverage. The Docker helper has unit tests, but thepullImage()/installVllm()path now awaitsdockerPullWithProgressWatchdog()and maps watchdog results to user-visible errors. Please add a small unit test with a mocked watchdog that verifies the vLLM caller passes the expected timeout/logging options and mapsstall,max, and non-timeout failures correctly.
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
…nto fix/docker-pull-stall-timeout
|
@cv Addressed the review comment. Please take a look. |
## Summary - Add the missing `v0.0.57` release-notes section with links to the detailed docs pages for command, inference, onboarding, messaging, status, installer, and policy changes. - Remove public references to docs-skip terms from source docs and regenerate the NemoClaw user skills from the current Fern MDX docs. - Carry forward generated references for the per-agent documentation split, including Hermes-specific reference files. ## Source summary - #4615 and #4653 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover host-side `sessions` and `agents` commands plus `NEMOCLAW_EXTRA_AGENTS_JSON` secondary-agent baking. - #4163, #4204, #4611, #4619, and #4676 -> `docs/about/release-notes.mdx`, `docs/inference/use-local-inference.mdx`: Release notes now cover managed vLLM progress/readiness, DGX Spark model default changes, local Ollama streaming usage, and inference route divergence warnings. - #4267, #4601, #4609, #4642, #4645, and #4661 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover UFW auto-remediation, local-inference reachability gates, gateway reuse/binding, cancel rollback, and policy selection persistence. - #4577, #4582, #4607, and #4660 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/messaging-channels.mdx`: Release notes now cover Slack validation, atomic `channels add`, WhatsApp QR diagnostics, and Slack placeholder normalization. - #4388, #4600, #4646, and #4647 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover status failure layers, paused-container hints, Docker-driver doctor behavior, and non-destructive stale-registry recovery. - #4569, #4579, and #4678 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/lifecycle.mdx`, `docs/network-policy/integration-policy-examples.mdx`: Release notes now cover installer tag pinning, PyPI `uv` policy access, and observable Jira validation. - #4632 -> `.agents/skills/`: Regenerated user skills from the current per-agent docs source, including newly generated Hermes reference files. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" docs --glob "*.mdx"` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" .agents/skills --glob "*.md"` - `npm run docs` - `npm run build:cli` - Commit hooks: markdownlint, docs-to-skills verification, gitleaks, skills YAML, commitlint <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Restructured documentation to clearly distinguish OpenClaw and Hermes agent variants throughout user guides. * Enhanced security, credential storage, and deployment guidance with clearer setup flows. * Added Hermes plugin installation and ecosystem documentation. * Improved workspace, messaging, and policy management references with variant-specific command examples. * Refined troubleshooting and CLI reference sections for clarity. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
This PR replaces the hard wall-clock timeout used for the vLLM Docker image pull with a progress-aware watchdog. Slow but active pulls can continue, while genuinely stalled pulls are terminated after an idle period and still have a long maximum safety budget.
Changes
docker pull <image>argv.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com
Summary by CodeRabbit
New Features
Tests