feat(ci): add non-root sandbox smoke test as PR gate#2711
Conversation
Add a lightweight E2E job to pr-self-hosted.yaml that validates the sandbox container starts successfully under --security-opt=no-new-privileges (the constraint applied by OpenShell on Brev Launchable and DGX Spark). The test: 1. Starts the production image as the sandbox user with no-new-privileges 2. Polls /health until the gateway responds (HTTP 200 or 401) 3. Verifies 'openclaw tui --help' succeeds (gateway auth token available) 4. Confirms the non-root fallback path is exercised in logs This would have caught Bug 1 from the Apr 20-25 regression (#2472) where install_configure_guard crashed under Landlock + set -e, preventing the gateway from starting — a 5-day outage invisible to all existing CI gates. Estimated CI overhead: ~2 minutes (health poll dominates). Closes #2571
📝 WalkthroughWalkthroughThis PR introduces a new CI job and supporting test script to validate sandbox startup under non-root constraints with security restrictions, catching potential entrypoint failures early in the development cycle. Changes
Sequence DiagramsequenceDiagram
actor Runner as GitHub Runner
participant Script as test-non-root-smoke.sh
participant Docker as Docker Engine
participant Container as Container (non-root)
participant Gateway as Gateway Service
participant CLI as OpenCLAW CLI
Runner->>Script: Execute smoke test
Script->>Docker: Check Docker availability
Script->>Docker: Verify image exists
Script->>Docker: Extract UID/GID from image
Script->>Docker: Start container (--security-opt no-new-privileges)
Docker->>Container: Launch with security constraints
Container->>Gateway: Entrypoint starts gateway
Script->>Container: Wait for startup
Script->>Gateway: Poll /health endpoint (timeout: 60s)
Gateway-->>Script: Response (200 or 401)
Script->>Container: Execute openclaw tui --help
Container->>CLI: Start CLI tool
CLI-->>Container: Output result
Script->>Script: Verify auth token not missing
Script->>Docker: Capture logs
Script->>Docker: Remove container
Script-->>Runner: Test result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Review rate limit: 8/10 reviews remaining, refill in 10 minutes and 53 seconds. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/test-non-root-smoke.sh`:
- Around line 153-158: The curl in the health probe loop (where HTTP_CODE is
set) lacks per-request timeouts so a single stalled connection can block the
whole polling loop; update the curl invocation that sets HTTP_CODE to include
--connect-timeout 1 and --max-time "$HEALTH_INTERVAL" (using the existing
HEALTH_INTERVAL variable) so each probe respects the polling interval and the
overall HEALTH_TIMEOUT behavior remains effective.
- Around line 118-124: The docker run invocation currently redirects both stdout
and stderr to /dev/null which hides failures and causes the script to exit
before the diagnostic block can run; change the invocation that starts the
container (the docker run that uses CONTAINER_NAME, SB_UID, SB_GID, HOST_PORT,
DASHBOARD_PORT, and IMAGE) so that you capture stderr (e.g., to a temp file or
variable) instead of discarding it, check the docker run exit status, and on
failure print the captured stderr (and any contextual info) before exiting so
the existing crash-diagnostic block can run; ensure you still respect set -e by
handling the failure explicitly (checking $? or using ||) rather than relying on
silent redirection.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: bd0fc978-3cd1-4405-acff-45727c88d0a1
📒 Files selected for processing (2)
.github/workflows/pr-self-hosted.yamltest/e2e/test-non-root-smoke.sh
| info "2. Start container with --security-opt no-new-privileges --user $SB_UID:$SB_GID" | ||
| docker run -d \ | ||
| --name "$CONTAINER_NAME" \ | ||
| --security-opt no-new-privileges \ | ||
| --user "${SB_UID}:${SB_GID}" \ | ||
| -p "${HOST_PORT}:${DASHBOARD_PORT}" \ | ||
| "$IMAGE" >/dev/null 2>&1 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n test/e2e/test-non-root-smoke.sh | sed -n '100,150p'Repository: NVIDIA/NemoClaw
Length of output: 2179
🏁 Script executed:
head -30 test/e2e/test-non-root-smoke.shRepository: NVIDIA/NemoClaw
Length of output: 1258
🏁 Script executed:
grep -n "set -e" test/e2e/test-non-root-smoke.shRepository: NVIDIA/NemoClaw
Length of output: 156
Expose docker run failure messages for debugging.
With set -euo pipefail enabled and both stdout and stderr redirected to /dev/null on line 124, if the docker run command fails, the error message is hidden and the script exits immediately due to set -e—preventing the crash-diagnostic block (lines 126-134) from executing. This makes it difficult to diagnose why the container failed to start.
Wrap the command to catch failures and preserve stderr visibility:
Suggested fix
info "2. Start container with --security-opt no-new-privileges --user $SB_UID:$SB_GID"
-docker run -d \
+if ! docker run -d \
--name "$CONTAINER_NAME" \
--security-opt no-new-privileges \
--user "${SB_UID}:${SB_GID}" \
-p "${HOST_PORT}:${DASHBOARD_PORT}" \
- "$IMAGE" >/dev/null 2>&1
+ "$IMAGE" >/dev/null; then
+ fail "Container failed to start"
+ exit 1
+fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| info "2. Start container with --security-opt no-new-privileges --user $SB_UID:$SB_GID" | |
| docker run -d \ | |
| --name "$CONTAINER_NAME" \ | |
| --security-opt no-new-privileges \ | |
| --user "${SB_UID}:${SB_GID}" \ | |
| -p "${HOST_PORT}:${DASHBOARD_PORT}" \ | |
| "$IMAGE" >/dev/null 2>&1 | |
| info "2. Start container with --security-opt no-new-privileges --user $SB_UID:$SB_GID" | |
| if ! docker run -d \ | |
| --name "$CONTAINER_NAME" \ | |
| --security-opt no-new-privileges \ | |
| --user "${SB_UID}:${SB_GID}" \ | |
| -p "${HOST_PORT}:${DASHBOARD_PORT}" \ | |
| "$IMAGE" >/dev/null; then | |
| fail "Container failed to start" | |
| exit 1 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/test-non-root-smoke.sh` around lines 118 - 124, The docker run
invocation currently redirects both stdout and stderr to /dev/null which hides
failures and causes the script to exit before the diagnostic block can run;
change the invocation that starts the container (the docker run that uses
CONTAINER_NAME, SB_UID, SB_GID, HOST_PORT, DASHBOARD_PORT, and IMAGE) so that
you capture stderr (e.g., to a temp file or variable) instead of discarding it,
check the docker run exit status, and on failure print the captured stderr (and
any contextual info) before exiting so the existing crash-diagnostic block can
run; ensure you still respect set -e by handling the failure explicitly
(checking $? or using ||) rather than relying on silent redirection.
| info "4. Poll gateway /health (timeout: ${HEALTH_TIMEOUT}s)" | ||
| ELAPSED=0 | ||
| HEALTH_OK=false | ||
|
|
||
| while [ "$ELAPSED" -lt "$HEALTH_TIMEOUT" ]; do | ||
| HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "http://localhost:${HOST_PORT}/health" 2>/dev/null || echo "000") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check file existence and read the relevant section
wc -l test/e2e/test-non-root-smoke.shRepository: NVIDIA/NemoClaw
Length of output: 95
🏁 Script executed:
# Read the file around lines 153-158 with context
sed -n '140,170p' test/e2e/test-non-root-smoke.shRepository: NVIDIA/NemoClaw
Length of output: 1008
🏁 Script executed:
# Search for HEALTH_INTERVAL and HEALTH_TIMEOUT definitions
rg 'HEALTH_(INTERVAL|TIMEOUT)' test/e2e/test-non-root-smoke.sh -B 2 -A 2Repository: NVIDIA/NemoClaw
Length of output: 981
🏁 Script executed:
# Check if SPDX license header is present
head -5 test/e2e/test-non-root-smoke.shRepository: NVIDIA/NemoClaw
Length of output: 249
Add per-request timeouts to the curl health check.
Line 158 calls curl without timeout flags. If the socket accepts but never responds, a single probe can block indefinitely, defeating the 60s timeout guarantee. Add --connect-timeout 1 --max-time "$HEALTH_INTERVAL" to ensure each request times out within the polling interval:
Suggested fix
- HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "http://localhost:${HOST_PORT}/health" 2>/dev/null || echo "000")
+ HTTP_CODE=$(curl -s --connect-timeout 1 --max-time "$HEALTH_INTERVAL" -o /dev/null -w "%{http_code}" "http://localhost:${HOST_PORT}/health" 2>/dev/null || echo "000")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| info "4. Poll gateway /health (timeout: ${HEALTH_TIMEOUT}s)" | |
| ELAPSED=0 | |
| HEALTH_OK=false | |
| while [ "$ELAPSED" -lt "$HEALTH_TIMEOUT" ]; do | |
| HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "http://localhost:${HOST_PORT}/health" 2>/dev/null || echo "000") | |
| info "4. Poll gateway /health (timeout: ${HEALTH_TIMEOUT}s)" | |
| ELAPSED=0 | |
| HEALTH_OK=false | |
| while [ "$ELAPSED" -lt "$HEALTH_TIMEOUT" ]; do | |
| HTTP_CODE=$(curl -s --connect-timeout 1 --max-time "$HEALTH_INTERVAL" -o /dev/null -w "%{http_code}" "http://localhost:${HOST_PORT}/health" 2>/dev/null || echo "000") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/test-non-root-smoke.sh` around lines 153 - 158, The curl in the
health probe loop (where HTTP_CODE is set) lacks per-request timeouts so a
single stalled connection can block the whole polling loop; update the curl
invocation that sets HTTP_CODE to include --connect-timeout 1 and --max-time
"$HEALTH_INTERVAL" (using the existing HEALTH_INTERVAL variable) so each probe
respects the polling interval and the overall HEALTH_TIMEOUT behavior remains
effective.
Summary
Add a lightweight E2E job to
pr-self-hosted.yamlthat validates the sandbox container starts successfully under--security-opt=no-new-privileges— the constraint applied by OpenShell on Brev Launchable and DGX Spark.What This Tests
no-new-privileges/healthuntil the gateway responds (HTTP 200 or 401) within 60sopenclaw tui --helpsucceeds (gateway auth token available)What This Would Have Caught
install_configure_guardwrites to.bashrc/.profilecrash under Landlock +set -e→ gateway never starts → HTTP probe fails immediatelyopenclaw.json→openclaw tuifails with "Missing gateway auth token"Implementation
test/e2e/test-non-root-smoke.sh— standalone E2E script following existing patterns (e2e-gateway-isolation.sh,e2e-port-overrides.sh)pr-self-hosted.yaml— newtest-non-root-sandbox-smokejob afterbuild-sandbox-images(reusesisolation-imageartifact)CI Impact
linux-amd64-cpu4(same as other E2E jobs)Strategy Gap Addressed
Closes #2571
Summary by CodeRabbit
Tests
Chores