Skip to content

fix(docker): replace docker pull hard timeout with progress watchdog for vllm#4163

Merged
cv merged 7 commits into
mainfrom
fix/docker-pull-stall-timeout
Jun 1, 2026
Merged

fix(docker): replace docker pull hard timeout with progress watchdog for vllm#4163
cv merged 7 commits into
mainfrom
fix/docker-pull-stall-timeout

Conversation

@zyang-dev

@zyang-dev zyang-dev commented May 25, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR replaces the hard wall-clock timeout used for the vLLM Docker image pull with a progress-aware watchdog. Slow but active pulls can continue, while genuinely stalled pulls are terminated after an idle period and still have a long maximum safety budget.

Changes

  • Add a Docker pull watchdog helper that tracks forward progress from Docker pull output.
  • Detect stalled pulls based on lack of new progress, instead of total elapsed time alone.
  • Keep a maximum pull budget as a safety net for pathological cases.
  • Wire the watchdog only into the vLLM image pull path.
  • Add tests for successful pulls, stall timeout, max timeout, spawn errors, progress parsing, and plain docker pull <image> argv.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com

Summary by CodeRabbit

  • New Features

    • Added a Docker pull progress watchdog that classifies progress, detects stalls vs. overall time limits, attempts graceful termination on timeout, and captures recent pull output.
    • Integrated the watchdog into model-image pulls and increased the max safety budget for large downloads.
  • Tests

    • Added tests covering progress classification, stall and max-time behavior, termination signaling, output bounding, and spawn-error handling.

Review Change Stack

…for vllm

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@zyang-dev zyang-dev self-assigned this May 25, 2026
@coderabbitai

coderabbitai Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 275907ee-fc4c-45fc-92aa-7f4f5ae14bc9

📥 Commits

Reviewing files that changed from the base of the PR and between 3857138 and 6819175.

📒 Files selected for processing (3)
  • src/lib/adapters/docker/pull.test.ts
  • src/lib/adapters/docker/pull.ts
  • src/lib/inference/vllm.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/adapters/docker/pull.test.ts

📝 Walkthrough

Walkthrough

Adds a docker pull progress watchdog (parsing, dedupe keys, timeouts), tests the stall/max timeout semantics and output tailing, and integrates the watchdog into vLLM image pulls with an increased 12-hour safety budget.

Changes

Docker Pull Watchdog

Layer / File(s) Summary
Watchdog types, constants, and progress signature parsing
src/lib/adapters/docker/pull.ts
Exports DEFAULT_DOCKER_PULL_STALL_TIMEOUT_MS, DEFAULT_DOCKER_PULL_MAX_TIMEOUT_MS, interfaces (DockerPullWatchdogOptions, DockerPullWatchdogResult, DockerPullReadable, DockerPullChildProcess), and implements dockerPullProgressSignature() / dockerPullProgressKey() to normalize docker pull output into stable signatures.
Watchdog implementation with timeout management
src/lib/adapters/docker/pull.ts
Adds dockerPullWithProgressWatchdog() that spawns docker pull (injectable spawnImpl), buffers trimmed non-empty stdout/stderr lines (tail-capped), updates last-progress only on new deduped signatures, enforces stall and max timeouts, sends SIGTERM then scheduled SIGKILL, and resolves DockerPullWatchdogResult with status, signal, timeoutKind, and captured output. Includes helpers to normalize env and clamp/format timeouts.
Watchdog test suite
src/lib/adapters/docker/pull.test.ts
Vitest tests with FakeReadable/FakeChild: verify dockerPullProgressSignature covers multiple docker formats, confirm spawn args and stdio when suppressOutput is true, assert non-termination when progress advances, detect stall timeout and max-budget timeout (SIGTERM + status 124), validate 1ms clamping, ensure output is tail-capped to last 200 lines, and check spawn-error handling.
vLLM image pull integration
src/lib/inference/vllm.ts
Switches to ESM imports, exports VllmProfile, refactors pullImage to export async and to call dockerPullWithProgressWatchdog with maxTimeoutMs from profile.pullTimeoutSec, increases Spark pullTimeoutSec to 12 * 60 * 60, updates stream wiring to optional-chained .?.on(...), and awaits pullImage in installVllm.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Code as dockerPullWithProgressWatchdog
  participant Spawn as spawnImpl
  participant Docker as "docker pull" child
  User->>Code: dockerPullWithProgressWatchdog(imageRef, opts)
  Code->>Spawn: spawn("docker", ["pull", imageRef], {cwd, env, stdio})
  Spawn-->>Docker: start process
  loop stream lines
    Docker-->>Code: stderr/stdout lines
    Code->>Code: dockerPullProgressSignature(line) -> update lastProgress when new
  end
  alt stall or max timeout exceeded
    Code->>Docker: kill(SIGTERM) then schedule SIGKILL
    Docker-->>Code: close(status 124)
    Code-->>User: resolve(timeoutKind: "stall"|"max", status: 124, output)
  else process exits normally
    Docker-->>Code: close(exit status)
    Code-->>User: resolve(status, output, timeoutKind: null)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

Docker, fix

Suggested reviewers

  • cv
  • ericksoa

"I'm a rabbit on a nightly jog,
Watching docker layers in the fog,
I mark new progress, tail the logs,
If stalls appear I wave my flags,
Timeouts sent with gentle hops and clogs."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(docker): replace docker pull hard timeout with progress watchdog for vllm' accurately and specifically describes the main change: replacing a hard timeout with a progress-aware watchdog for Docker pulls in the vLLM integration.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/docker-pull-stall-timeout

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@zyang-dev zyang-dev added the v0.0.51 Release target label May 25, 2026
@github-actions

github-actions Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: full-e2e, gpu-e2e
Optional E2E: onboard-inference-smoke-e2e, inference-routing-e2e

Dispatch hint: full-e2e,gpu-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • full-e2e (high; live Docker/OpenShell sandbox plus NVIDIA Endpoints inference): Required baseline because src/lib/inference/vllm.ts is loaded through onboard and the change affects installer/onboarding control flow. This verifies install.sh, onboard, sandbox creation, and live inference still work for the primary user journey.
  • gpu-e2e (high; requires NVKS GPU runner and model pull): Closest existing GPU/local-inference E2E. It does not exercise managed vLLM directly, but it validates that local GPU onboarding, Docker GPU runtime setup, sandbox inference wiring, and real assistant flow still work after changes in local inference code.

Optional E2E

  • onboard-inference-smoke-e2e (low; hermetic mocked route, no live sandbox): Low-cost regression guard that imports onboard and verifies onboarding fails closed when an inference route is configured but runtime-broken. Useful adjacent coverage for inference/onboard module-loading changes, but it does not cover vLLM image pull behavior.
  • inference-routing-e2e (medium; Docker/OpenShell plus selective live provider checks): Adjacent confidence for provider routing through inference.local and credential isolation. Optional because this PR specifically changes managed vLLM image pull/install rather than generic provider routing.

New E2E recommendations

  • managed vLLM install / docker pull watchdog (high): Existing E2E coverage appears to exercise Ollama GPU and cloud inference, but not NEMOCLAW_PROVIDER=install-vllm with the managed vLLM pull/download/start/readiness path changed here. A regression in watchdog progress parsing, stall timeout, max timeout, or async pullImage would not be caught by current required jobs.
    • Suggested test: Add a GPU-runner E2E for NEMOCLAW_EXPERIMENTAL=1 NEMOCLAW_PROVIDER=install-vllm that runs install/onboard, pulls or uses a cached vLLM image, waits for /v1/models, registers vllm-local, and proves a sandbox inference request succeeds.
  • docker pull watchdog diagnostics (medium): Unit tests cover parsing and timer behavior with fake child processes, but no E2E validates real Docker CLI pull output, bounded diagnostic tails, and failure messaging on an actual runner.
    • Suggested test: Add a PR-safe script E2E using a small public image and a controlled fake/stalled docker wrapper to assert dockerPullWithProgressWatchdog behavior from the built CLI path without requiring the full vLLM image.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: full-e2e,gpu-e2e

@github-actions

github-actions Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw
Optional scenario E2E: gpu-repo-local-ollama-openclaw

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: medium

Required scenario E2E

  • ubuntu-repo-cloud-openclaw: Smallest standard scenario that exercises repo-checkout onboarding with Docker running and cloud inference. The Docker adapter barrel exports the changed pull helper, and this scenario catches import/runtime regressions in Docker-backed onboarding and sandbox startup; no dispatchable ROUTES scenario directly exercises managed vLLM install.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional scenario E2E

  • gpu-repo-local-ollama-openclaw: Adjacent GPU/Docker local-inference path and closest available special-runner coverage for containerized local inference changes, though it uses Ollama rather than the changed vLLM install path.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw

Relevant changed files

  • src/lib/adapters/docker/pull.ts
  • src/lib/inference/vllm.ts

@github-actions

github-actions Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 0 nice ideas
Since last review: 0 prior items resolved, 1 still applies, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Add vLLM caller coverage for watchdog wiring (src/lib/inference/vllm.ts:230): The new Docker pull watchdog helper has direct unit coverage, but the production vLLM install path changed from a synchronous shell timeout to an awaited watchdog call without caller-level tests. That leaves the caller contract unverified: passing the profile image, mapping pullTimeoutSec to maxTimeoutMs, forwarding progress logging, awaiting the pull before later install steps, and converting stall/max failures into user-facing reasons.
    • Recommendation: Add a focused unit test around pullImage() or installVllm() with dockerPullWithProgressWatchdog mocked. Cover at least success, stall timeout, max timeout, generic non-zero failure, and aborting later install steps after pull failure. Also identify targeted runtime/sandbox validation for real Docker pull progress output and signal behavior.
    • Evidence: Repository search found watchdog tests only in src/lib/adapters/docker/pull.test.ts. The changed caller now does await dockerPullWithProgressWatchdog(profile.image, { maxTimeoutMs: profile.pullTimeoutSec * 1000, logLine: emit }) and maps timeoutKind values in src/lib/inference/vllm.ts:230-244, but no test exercises pullImage() or installVllm() wiring.

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • Add vLLM caller coverage for watchdog wiring (src/lib/inference/vllm.ts:230): The new Docker pull watchdog helper has direct unit coverage, but the production vLLM install path changed from a synchronous shell timeout to an awaited watchdog call without caller-level tests. That leaves the caller contract unverified: passing the profile image, mapping pullTimeoutSec to maxTimeoutMs, forwarding progress logging, awaiting the pull before later install steps, and converting stall/max failures into user-facing reasons.
    • Recommendation: Add a focused unit test around pullImage() or installVllm() with dockerPullWithProgressWatchdog mocked. Cover at least success, stall timeout, max timeout, generic non-zero failure, and aborting later install steps after pull failure. Also identify targeted runtime/sandbox validation for real Docker pull progress output and signal behavior.
    • Evidence: Repository search found watchdog tests only in src/lib/adapters/docker/pull.test.ts. The changed caller now does await dockerPullWithProgressWatchdog(profile.image, { maxTimeoutMs: profile.pullTimeoutSec * 1000, logLine: emit }) and maps timeoutKind values in src/lib/inference/vllm.ts:230-244, but no test exercises pullImage() or installVllm() wiring.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/adapters/docker/pull.ts`:
- Around line 239-243: positiveMs currently floors positive fractional inputs so
values in (0,1) become 0; update positiveMs to treat any positive value less
than 1 as 1 (i.e., clamp sub-millisecond positives to at least 1) while keeping
fallbackMs for non-positive/invalid inputs — change the logic in function
positiveMs to return 1 for 0<value<1 and Math.floor(value) for value>=1,
otherwise fallbackMs.
- Around line 79-83: The BuildKit progress detection currently keys on the whole
normalized line (using normalized.match and returning `buildkit:${normalized}`),
which lets ETA/rate changes mask stalls; update the logic to key the progress
signature only on the captured transferred and total byte counters from the
regex (the capture groups produced by normalized.match into buildKitPull)
instead of the entire normalized string—e.g., return a stable signature like
`buildkit:${buildKitPull[1]}/${buildKitPull[2]}` (or otherwise normalize those
two groups) so progress is driven by bytes transferred/total rather than
ETA/rate text.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ee763d91-12f0-4720-a279-7a7f686e702f

📥 Commits

Reviewing files that changed from the base of the PR and between 8be9986 and 33ce647.

📒 Files selected for processing (3)
  • src/lib/adapters/docker/pull.test.ts
  • src/lib/adapters/docker/pull.ts
  • src/lib/inference/vllm.ts

Comment thread src/lib/adapters/docker/pull.ts Outdated
Comment thread src/lib/adapters/docker/pull.ts
zyang-dev and others added 2 commits May 24, 2026 23:26
…d clamped positive sub-millisecond timeouts to 1ms

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@zyang-dev zyang-dev added the provider: vllm vLLM local or hosted provider behavior label May 26, 2026
@cv cv added v0.0.52 Release target v0.0.53 Release target and removed v0.0.51 Release target v0.0.52 Release target labels May 26, 2026

@cv cv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. I’m still requesting changes before this merges because a few pieces of the prior feedback are not fully resolved:

  1. src/lib/adapters/docker/pull.ts: BuildKit-style pull progress is no longer recognized at all. The earlier concern was that BuildKit progress should be keyed on stable byte counters rather than ETA/rate text; the current implementation removes that path, so BuildKit-only progress lines would return null and the watchdog could stall-kill an actively advancing pull. Please either restore BuildKit progress detection using only normalized transferred/total counters (and add/keep a regression test), or explicitly document that BuildKit-style output is unsupported and make sure no caller/test/review claim depends on it.

  2. src/lib/adapters/docker/pull.ts: output/progress retention is still unbounded. lines retains every flushed output line and seenProgress retains every distinct progress signature for pulls that can run for the 12-hour vLLM budget. Please bound diagnostic output to a tail and bound/compact progress tracking (for example latest signature per layer/phase or a bounded LRU) so noisy external Docker output cannot grow NemoClaw memory indefinitely.

  3. src/lib/inference/vllm.ts: the production vLLM wiring still lacks focused coverage. The Docker helper has unit tests, but the pullImage() / installVllm() path now awaits dockerPullWithProgressWatchdog() and maps watchdog results to user-visible errors. Please add a small unit test with a mocked watchdog that verifies the vLLM caller passes the expected timeout/logging options and maps stall, max, and non-timeout failures correctly.

zyang-dev added 2 commits May 27, 2026 09:58
Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@zyang-dev

Copy link
Copy Markdown
Contributor Author

@cv Addressed the review comment. Please take a look.

@ericksoa ericksoa added v0.0.55 and removed v0.0.53 Release target labels May 27, 2026
@jyaunches jyaunches added R2 v0.0.56 Release target and removed v0.0.55 labels May 29, 2026
@cv cv added v0.0.57 Release target and removed v0.0.56 Release target labels Jun 1, 2026
@cv cv enabled auto-merge (squash) June 1, 2026 22:18
@cv cv merged commit 5e60860 into main Jun 1, 2026
27 checks passed
@cv cv deleted the fix/docker-pull-stall-timeout branch June 1, 2026 22:22
cv pushed a commit that referenced this pull request Jun 3, 2026
## Summary
- Add the missing `v0.0.57` release-notes section with links to the
detailed docs pages for command, inference, onboarding, messaging,
status, installer, and policy changes.
- Remove public references to docs-skip terms from source docs and
regenerate the NemoClaw user skills from the current Fern MDX docs.
- Carry forward generated references for the per-agent documentation
split, including Hermes-specific reference files.

## Source summary
- #4615 and #4653 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Release notes now cover host-side
`sessions` and `agents` commands plus `NEMOCLAW_EXTRA_AGENTS_JSON`
secondary-agent baking.
- #4163, #4204, #4611, #4619, and #4676 ->
`docs/about/release-notes.mdx`,
`docs/inference/use-local-inference.mdx`: Release notes now cover
managed vLLM progress/readiness, DGX Spark model default changes, local
Ollama streaming usage, and inference route divergence warnings.
- #4267, #4601, #4609, #4642, #4645, and #4661 ->
`docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release
notes now cover UFW auto-remediation, local-inference reachability
gates, gateway reuse/binding, cancel rollback, and policy selection
persistence.
- #4577, #4582, #4607, and #4660 -> `docs/about/release-notes.mdx`,
`docs/manage-sandboxes/messaging-channels.mdx`: Release notes now cover
Slack validation, atomic `channels add`, WhatsApp QR diagnostics, and
Slack placeholder normalization.
- #4388, #4600, #4646, and #4647 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Release notes now cover status failure
layers, paused-container hints, Docker-driver doctor behavior, and
non-destructive stale-registry recovery.
- #4569, #4579, and #4678 -> `docs/about/release-notes.mdx`,
`docs/manage-sandboxes/lifecycle.mdx`,
`docs/network-policy/integration-policy-examples.mdx`: Release notes now
cover installer tag pinning, PyPI `uv` policy access, and observable
Jira validation.
- #4632 -> `.agents/skills/`: Regenerated user skills from the current
per-agent docs source, including newly generated Hermes reference files.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" docs --glob "*.mdx"`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" .agents/skills --glob "*.md"`
- `npm run docs`
- `npm run build:cli`
- Commit hooks: markdownlint, docs-to-skills verification, gitleaks,
skills YAML, commitlint

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Restructured documentation to clearly distinguish OpenClaw and Hermes
agent variants throughout user guides.
* Enhanced security, credential storage, and deployment guidance with
clearer setup flows.
  * Added Hermes plugin installation and ecosystem documentation.
* Improved workspace, messaging, and policy management references with
variant-specific command examples.
  * Refined troubleshooting and CLI reference sections for clarity.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression provider: vllm vLLM local or hosted provider behavior v0.0.57 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants