fix(docker): add HEALTHCHECK instruction to Dockerfile#1485
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds HEALTHCHECKs: runtime ChangesDocker HEALTHCHECKs
Sequence Diagram(s)sequenceDiagram
participant DockerEngine as Docker Engine
participant Container as App Container
participant HealthProbe as HEALTHCHECK Probe
DockerEngine->>Container: start container
DockerEngine-)HealthProbe: schedule probe (interval, start-period, retries)
HealthProbe->>Container: execute probe (HTTP curl or node command)
alt probe success (2xx / exit 0)
HealthProbe-->>DockerEngine: report healthy
else probe failure (non-2xx / non-zero exit)
HealthProbe-->>DockerEngine: report unhealthy
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
Dockerfile (1)
177-178: Consider adding a startup grace period to reduce cold-start false negatives.A short
--start-periodtypically improves reliability when services take time to bootstrap.Proposed tweak
-HEALTHCHECK --interval=30s --timeout=5s --retries=3 \ - CMD curl -sf http://127.0.0.1:18789/health || exit 1 +HEALTHCHECK --interval=30s --timeout=5s --start-period=45s --retries=3 \ + CMD curl -fsS http://127.0.0.1:18789/health || exit 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 177 - 178, The Dockerfile's HEALTHCHECK currently lacks a startup grace period, causing cold-starts to be reported as failures; update the HEALTHCHECK directive that uses "HEALTHCHECK --interval=30s --timeout=5s --retries=3 CMD curl -sf http://127.0.0.1:18789/health || exit 1" to include a suitable --start-period (for example --start-period=60s or another value appropriate for your service) so Docker will wait before beginning health probes, keeping the rest of the flags intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Dockerfile`:
- Around line 174-179: The Dockerfile change added a HEALTHCHECK block
(HEALTHCHECK --interval=30s --timeout=5s --retries=3 \ CMD curl -sf
http://127.0.0.1:18789/health || exit 1) but the same directive must also be
added to Dockerfile.base; open Dockerfile.base and add an identical HEALTHCHECK
directive (same interval, timeout, retries and CMD curl -sf
http://127.0.0.1:18789/health || exit 1) at the analogous location where health
checks are defined in Dockerfile (near the end or after the service start/EXPOSE
steps) so both images include the same container health probe.
---
Nitpick comments:
In `@Dockerfile`:
- Around line 177-178: The Dockerfile's HEALTHCHECK currently lacks a startup
grace period, causing cold-starts to be reported as failures; update the
HEALTHCHECK directive that uses "HEALTHCHECK --interval=30s --timeout=5s
--retries=3 CMD curl -sf http://127.0.0.1:18789/health || exit 1" to include a
suitable --start-period (for example --start-period=60s or another value
appropriate for your service) so Docker will wait before beginning health
probes, keeping the rest of the flags intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
There was a problem hiding this comment.
🧹 Nitpick comments (1)
Dockerfile.base (1)
128-129: Derived test image inherits insufficient base HEALTHCHECKThe base HEALTHCHECK (lines 128-129) only validates that
nodecan execute, not that the actual service is running. The verification confirms thattest/Dockerfile.sandbox(line 8 of provided context) inherits from${BASE_IMAGE}but does not define its own HEALTHCHECK override, leaving it vulnerable to false-positive health reports.To mitigate this, either require all child images to define a service-level HEALTHCHECK probe, or parameterize the base command so inheritors can inject a meaningful healthcheck (e.g., a startup script or port check).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile.base` around lines 128 - 129, The base HEALTHCHECK is too superficial (only running node) and causes child images like test/Dockerfile.sandbox to inherit a false-positive probe; add a build-time argument (e.g., ARG HEALTHCHECK_CMD) and use it in the HEALTHCHECK instruction so inheritors can supply a meaningful check, or alternatively remove the base HEALTHCHECK and document that every child image must define its own HEALTHCHECK; update the HEALTHCHECK line to reference HEALTHCHECK_CMD and ensure test/Dockerfile.sandbox sets HEALTHCHECK_CMD (or defines its own HEALTHCHECK) so the probe actually validates the service.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@Dockerfile.base`:
- Around line 128-129: The base HEALTHCHECK is too superficial (only running
node) and causes child images like test/Dockerfile.sandbox to inherit a
false-positive probe; add a build-time argument (e.g., ARG HEALTHCHECK_CMD) and
use it in the HEALTHCHECK instruction so inheritors can supply a meaningful
check, or alternatively remove the base HEALTHCHECK and document that every
child image must define its own HEALTHCHECK; update the HEALTHCHECK line to
reference HEALTHCHECK_CMD and ensure test/Dockerfile.sandbox sets
HEALTHCHECK_CMD (or defines its own HEALTHCHECK) so the probe actually validates
the service.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ee49aa74-0422-41e5-815b-0ce53bf2d577
📒 Files selected for processing (2)
DockerfileDockerfile.base
🚧 Files skipped from review as they are similar to previous changes (1)
- Dockerfile
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/Dockerfile.sandbox (1)
89-92: Remove the redundant|| exit 1suffix.The
node -e "process.exit(0)"command already returns non-zero on failure, which Docker HEALTHCHECK treats as unhealthy. The|| exit 1adds no functional value.♻️ Suggested simplification
# Test image: no long-running service, so just verify the runtime works. # Overrides any inherited HEALTHCHECK to avoid false-positive probes. HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \ - CMD node -e "process.exit(0)" || exit 1 + CMD node -e "process.exit(0)"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/Dockerfile.sandbox` around lines 89 - 92, The HEALTHCHECK line includes a redundant shell suffix "|| exit 1" after the command `node -e "process.exit(0)"`; remove the `|| exit 1` so the HEALTHCHECK simply runs `node -e "process.exit(0)"` (Docker already treats a non-zero exit from that command as unhealthy). Update the HEALTHCHECK invocation (the line containing HEALTHCHECK --interval=30s ... CMD node -e "process.exit(0)" || exit 1) to drop the `|| exit 1` fragment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/Dockerfile.sandbox`:
- Around line 89-92: The HEALTHCHECK line includes a redundant shell suffix "||
exit 1" after the command `node -e "process.exit(0)"`; remove the `|| exit 1` so
the HEALTHCHECK simply runs `node -e "process.exit(0)"` (Docker already treats a
non-zero exit from that command as unhealthy). Update the HEALTHCHECK invocation
(the line containing HEALTHCHECK --interval=30s ... CMD node -e
"process.exit(0)" || exit 1) to drop the `|| exit 1` fragment.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 76cfeeb0-cab5-4ba0-b302-e04a4fbc653f
📒 Files selected for processing (2)
Dockerfile.basetest/Dockerfile.sandbox
✅ Files skipped from review due to trivial changes (1)
- Dockerfile.base
|
✨ Thanks for submitting this fix, which proposes a way to add a HEALTHCHECK instruction to the Dockerfile. This enables better container health monitoring and improves reliability in standalone deployments. Possibly related open issues: |
dknos
left a comment
There was a problem hiding this comment.
Addressed: removed the redundant || exit 1 suffix from the HEALTHCHECK in test/Dockerfile.sandbox — node -e "process.exit(0)" already returns non-zero on failure, so the extra suffix was dead code. The base image (Dockerfile.base) and main Dockerfile HEALTHCHECK instructions were added in earlier commits on this branch and remain unchanged.
0e2fb12 to
a914e84
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
Dockerfile.base (1)
131-138: Well-documented baseline health check — minor simplification possible.The approach is sound: a minimal runtime validation for the base image, with clear guidance that child images override with service-specific probes. The parameters are reasonable.
The
|| exit 1suffix is redundant—ifnodefails to execute or exits non-zero, Docker already treats it as unhealthy. You can simplify to just the command.🔧 Optional simplification
HEALTHCHECK --interval=30s --timeout=5s --start-period=45s --retries=3 \ - CMD node -e "process.exit(0)" || exit 1 + CMD node -e "process.exit(0)"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile.base` around lines 131 - 138, Remove the redundant shell fallback at the end of the HEALTHCHECK; update the instruction that currently uses 'HEALTHCHECK ... CMD node -e "process.exit(0)" || exit 1' to simply run the node command directly (i.e., keep the HEALTHCHECK options and use CMD node -e "process.exit(0)" without the '|| exit 1' suffix) so Docker can interpret the exit status of Node itself.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@Dockerfile.base`:
- Around line 131-138: Remove the redundant shell fallback at the end of the
HEALTHCHECK; update the instruction that currently uses 'HEALTHCHECK ... CMD
node -e "process.exit(0)" || exit 1' to simply run the node command directly
(i.e., keep the HEALTHCHECK options and use CMD node -e "process.exit(0)"
without the '|| exit 1' suffix) so Docker can interpret the exit status of Node
itself.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 749adf45-8084-41dc-8866-917923489006
📒 Files selected for processing (3)
DockerfileDockerfile.basetest/Dockerfile.sandbox
✅ Files skipped from review due to trivial changes (1)
- test/Dockerfile.sandbox
🚧 Files skipped from review as they are similar to previous changes (1)
- Dockerfile
c8406c8 to
fc60ec8
Compare
|
We appreciate the contribution — a |
fc60ec8 to
c6ff2a8
Compare
|
Rebased onto current |
Enables Docker and orchestrators to detect unhealthy containers and trigger automatic restarts in standalone deployments. - Production Dockerfile probes the gateway /health endpoint on :18789. - Dockerfile.base ships a baseline node-runtime probe; child images that expose a service MUST override it with a service-specific check. - test/Dockerfile.sandbox overrides the inherited probe so its short-lived runtime does not produce false-positive failures. Fixes NVIDIA#1430 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: dknos <rneebo@gmail.com>
c6ff2a8 to
a6339f1
Compare
…check-1430 Signed-off-by: Aaron Erickson <aerickson@nvidia.com> # Conflicts: # Dockerfile # Dockerfile.base
ericksoa
left a comment
There was a problem hiding this comment.
Approved after resolving the current-main conflicts and validating the Dockerfile healthcheck change on the repaired head e79f965. Required PR checks are green; the full nightly is still running with high-signal Dockerfile/startup jobs passing, and the one observed failure is in onboard-negative-paths-e2e around the existing status command/test contract rather than the Dockerfile healthcheck path.
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 26185546689
|
## Summary Refreshes NemoClaw release notes for v0.0.47 and v0.0.48, then regenerates the corresponding user-skill references so agent-facing docs match the source pages. Preview: https://nvidia-preview-docs-release-notes-47-48.docs.buildwithfern.com/nemoclaw/about/release-notes ## Changes - Adds explicit v0.0.47 and v0.0.48 sections to `docs/about/release-notes.mdx`. - Documents follow-up WSL Ollama, sandbox image, share mount, and troubleshooting updates from recent release changes. - Regenerates `nemoclaw-user-*` skill references from the Fern MDX source docs. ## Source Summary - #4003 -> `docs/about/release-notes.mdx`: Notes the messaging manifest registry work as part of v0.0.48 release coverage. - #3984 -> `docs/about/release-notes.mdx`: Captures Hermes messaging policy scoping in the v0.0.48 release notes. - #3963 -> `docs/about/release-notes.mdx`: Captures DGX Spark Hermes GPU recreation startup recovery in the v0.0.48 release notes. - #3961 -> `docs/about/release-notes.mdx`: Captures Discord loopback proxy routing in the v0.0.48 release notes. - #3940 -> `docs/about/release-notes.mdx`: Captures installer prompt clarification and express-install behavior in the v0.0.48 release notes. - #3946 -> `docs/about/release-notes.mdx`: Carries forward the Homebrew preinstall clarification in release coverage. - #3937 -> `docs/about/release-notes.mdx`: Carries forward the dashboard URL command and post-install next steps coverage. - #3921 -> `docs/about/release-notes.mdx`: Carries forward managed vLLM default behavior for DGX Spark and DGX Station. - #3931 -> `docs/about/release-notes.mdx`, `docs/reference/architecture.mdx`: Documents the sandbox `python` to `python3` compatibility symlink. - #1485 -> `docs/about/release-notes.mdx`, `docs/reference/architecture.mdx`: Documents the sandbox image Docker health check. - #3784 -> `docs/about/release-notes.mdx`: Captures VM-driver snapshot health-check reliability in release notes. - #3917 -> `docs/about/release-notes.mdx`: Captures package-based workspace template resolution in release notes. - #3170 -> `docs/about/release-notes.mdx`: Captures installer checksum compatibility from preferring `sha256sum`. - #3898 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage for messaging provider scenario validation. - #3897 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage for baseline onboarding scenario validation. - #3834 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage for PR review advisor automation. - #3838 -> `docs/about/release-notes.mdx`: Adds v0.0.47 release coverage for CLI display registry refactoring. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) `make docs` was attempted but could not complete because `npx fern-api` failed with `403 Forbidden` from `https://registry.npmjs.org/fern-api` in this environment. Pre-commit and pre-push hooks passed after refreshing the local CLI build output with `npm run build:cli`; no build artifacts were committed. --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added WSL onboarding notes for Windows-host Ollama detection, restart guidance, and PowerShell checks. * Clarified express-install behavior (non-interactive, sudo prompts) and default sandbox policy selection. * Added Windows preparation guidance when installer tooling is missing (winget/App Installer or Docker Desktop). * Expanded sandbox docs with Docker health checks, Homebrew/python compatibility helpers, share-mount path validation, Discord troubleshooting, and new v0.0.48/v0.0.47 release notes. * **Chores** * Improved docs preview workflow error handling. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4007?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
HEALTHCHECKinstruction toDockerfilefor container health monitoringTest plan
HEALTHCHECKrunsdocker inspectFixes #1430
🤖 Generated with Claude Code
Summary by CodeRabbit
Signed-off-by: dknos rneebo@gmail.com