/status: show usage fetch errors instead of silently hiding#54803
/status: show usage fetch errors instead of silently hiding#54803felixclack wants to merge 18 commits intoopenclaw:mainfrom
Conversation
- Install Ruby 3.3.8 via ruby-build with bundler - Add PostgreSQL 16 with pgvector extension from official repo - Install GitHub CLI (gh) for repo operations - Add OpenAI Codex CLI globally - Configure fly.toml for clawdbot-felix app in lhr region - Remove non-root user constraint for volume access Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create fly-entrypoint.sh to symlink tool configs to /data volume - Persists ~/.codex, ~/.config/gh, ~/.gitconfig, ~/.ssh across restarts - Auto-creates openclaw wrapper in PATH on startup Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add sync-runtime-config.mjs: idempotent startup script that merges env vars into /data/openclaw.json (solves config-set-hangs problem) - Add fly-deploy.yml GitHub Actions workflow with secret validation, app/volume auto-creation, and one-shot RESET_CONFIG support - Bump NODE_OPTIONS to --max-old-space-size=2880 --optimize-for-size for the 4GB machine (was 1536 for 2GB) - Update fly.toml VM memory to 4096mb - Add gawk to Dockerfile (fixes mawk keyword clash with awk scripts) - Add RESET_CONFIG one-shot flag and PostgreSQL conditional start to entrypoint
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Splits Dockerfile into builder and runtime stages. Build tools (Rust, build-essential, Bun, devDependencies) stay in the builder stage and are excluded from the final image. Ruby is compiled to /opt/ruby for a clean copy. pnpm prune --prod strips dev deps after build. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This V8 flag is disallowed in NODE_OPTIONS and causes node to exit immediately with code 9. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sync-runtime-config script was creating config-level auth.profiles entries (mode: "token") for providers with env vars set. This overrode the gateway's auto-discovery of store-level profiles (e.g. OAuth tokens in auth-profiles.json), causing FailoverError: "No API key found for provider openai-codex" on cron tasks and Telegram responses. The fix removes config-level auth.profiles creation. Provider detection for model selection is preserved. Auth resolution now correctly falls back to the auth store where OAuth profiles live. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove [http_service] from fly.toml so Fly no longer routes public internet traffic to the gateway. The control UI and all gateway endpoints are now only reachable via Tailscale at clawdbot-fly:3000. Telegram continues working via outbound long polling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use `tailscale serve` to proxy HTTPS on port 443 to the gateway on port 3000, with automatic TLS cert provisioning. The control UI is now accessible at https://clawdbot-fly.<tailnet>.ts.net. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Codex <noreply@openai.com>
Improves UX by showing why usage info is missing. - Display provider API errors (e.g., timeout, auth failure) - Show structured errors from usage summary responses - Users no longer left wondering if usage tracking is broken - Better debugging signal for rate-limit/API issues
Greptile SummaryThis PR delivers a small but useful UX improvement: the Confidence Score: 4/5Safe to merge with the security note about running as root addressed or explicitly acknowledged. The core feature change is correct and well-implemented. The infrastructure changes are intentional for the Fly.io deployment context but include a notable security regression (dropping USER node) that should be acknowledged. The duplicate /health handler and unversioned curl installs are minor follow-up concerns that don't block the primary UX fix. Dockerfile (root user removal and unversioned installs), src/gateway/server-http.ts (duplicate /health handler)
|
| Filename | Overview |
|---|---|
| src/auto-reply/reply/commands-status.ts | Core UX fix: surfaces structured provider errors and caught exceptions as visible usage lines instead of silently nulling them out. Logic is correct and type-safe. |
| Dockerfile | Adds full Fly.io operator toolchain to runtime image and removes USER node non-root hardening; curl |
| src/gateway/server-http.ts | Adds early-return /health endpoint to bypass loadConfig() during startup, but duplicates an existing probe handler entry with a different response shape. |
| src/infra/sync-runtime-config.test.ts | New test suite for runtime config syncing; covers model/thinking/subagent overrides and control UI allowed-origins. |
Prompt To Fix All With AI
This is a comment left during a code review.
Path: Dockerfile
Line: 234
Comment:
**Container now runs as root — intentional?**
This PR removes both `RUN chown node:node /app`, all `--chown=node:node` COPY flags, and — most significantly — `USER node`. The original code included an explicit comment:
```
# Security hardening: Run as non-root user
# The node:24-bookworm image includes a 'node' user (uid 1000)
# This reduces the attack surface by preventing container escape via root privileges
USER node
```
The new toolchain (Tailscale, PostgreSQL, Bun in `/root/.bun/bin`) appears to require root at startup, and the entrypoint likely needs root to bootstrap these services. If that's the intent, it might be worth documenting with a comment why running as root is required here, or exploring whether the app process itself can drop privileges after the entrypoint bootstrapping completes (e.g., via `exec su-exec` or `gosu node "$@"` in `fly-entrypoint.sh`).
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: Dockerfile
Line: 131
Comment:
**Unversioned `curl | bash` installs reduce build reproducibility**
Two install scripts are fetched and executed without a version pin:
```dockerfile
RUN curl -fsSL https://bun.sh/install | bash # line 131 — always installs "latest"
RUN curl -fsSL https://tailscale.com/install.sh | sh # line 162 — same
```
Each Docker build may pull a different version, making the image non-deterministic and harder to audit. Consider pinning to a specific release (e.g., `BUN_VERSION=1.x.y` for bun, or the versioned Tailscale `apt` package) so the build is reproducible and the installed version is tracked in source control.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/gateway/server-http.ts
Line: 771-775
Comment:
**Duplicate `/health` handler shadows the existing probe infrastructure**
`/health` is already registered in `GATEWAY_PROBE_STATUS_BY_PATH` (line 110) and handled by `handleGatewayProbeRequest`, which returns `{ ok: true, status: "live" }`. The new early-return handler intercepts `req.url === "/health"` before the stages run and returns a different shape: `{ ok: true, ts: Date.now() }`.
The motivation is valid (bypass `loadConfig()` which can throw before the app is fully initialised, guaranteeing liveness probes always succeed on startup). If that's the goal, consider either:
- Removing `/health` from `GATEWAY_PROBE_STATUS_BY_PATH` to avoid the dead code path, or
- Adding a comment explaining why the early handler exists alongside the stage-based one.
As written, a reader has to trace the whole request pipeline to understand why `/health` is listed in the probe map but never reaches the probe handler.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "/status: show usage fetch errors instead..." | Re-trigger Greptile
| @@ -202,11 +233,9 @@ RUN --mount=type=cache,id=openclaw-bookworm-apt-cache,target=/var/cache/apt,shar | |||
| DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ | |||
| ca-certificates curl gnupg && \ | |||
There was a problem hiding this comment.
Container now runs as root — intentional?
This PR removes both RUN chown node:node /app, all --chown=node:node COPY flags, and — most significantly — USER node. The original code included an explicit comment:
# Security hardening: Run as non-root user
# The node:24-bookworm image includes a 'node' user (uid 1000)
# This reduces the attack surface by preventing container escape via root privileges
USER node
The new toolchain (Tailscale, PostgreSQL, Bun in /root/.bun/bin) appears to require root at startup, and the entrypoint likely needs root to bootstrap these services. If that's the intent, it might be worth documenting with a comment why running as root is required here, or exploring whether the app process itself can drop privileges after the entrypoint bootstrapping completes (e.g., via exec su-exec or gosu node "$@" in fly-entrypoint.sh).
Prompt To Fix With AI
This is a comment left during a code review.
Path: Dockerfile
Line: 234
Comment:
**Container now runs as root — intentional?**
This PR removes both `RUN chown node:node /app`, all `--chown=node:node` COPY flags, and — most significantly — `USER node`. The original code included an explicit comment:
```
# Security hardening: Run as non-root user
# The node:24-bookworm image includes a 'node' user (uid 1000)
# This reduces the attack surface by preventing container escape via root privileges
USER node
```
The new toolchain (Tailscale, PostgreSQL, Bun in `/root/.bun/bin`) appears to require root at startup, and the entrypoint likely needs root to bootstrap these services. If that's the intent, it might be worth documenting with a comment why running as root is required here, or exploring whether the app process itself can drop privileges after the entrypoint bootstrapping completes (e.g., via `exec su-exec` or `gosu node "$@"` in `fly-entrypoint.sh`).
How can I resolve this? If you propose a fix, please make it concise.|
|
||
| # Keep the Fly deployment runtime provisioned with the operator toolchain | ||
| # used by the live workspace and entrypoint bootstrapping. | ||
| RUN curl -fsSL https://bun.sh/install | bash |
There was a problem hiding this comment.
Unversioned
curl | bash installs reduce build reproducibility
Two install scripts are fetched and executed without a version pin:
RUN curl -fsSL https://bun.sh/install | bash # line 131 — always installs "latest"
RUN curl -fsSL https://tailscale.com/install.sh | sh # line 162 — sameEach Docker build may pull a different version, making the image non-deterministic and harder to audit. Consider pinning to a specific release (e.g., BUN_VERSION=1.x.y for bun, or the versioned Tailscale apt package) so the build is reproducible and the installed version is tracked in source control.
Prompt To Fix With AI
This is a comment left during a code review.
Path: Dockerfile
Line: 131
Comment:
**Unversioned `curl | bash` installs reduce build reproducibility**
Two install scripts are fetched and executed without a version pin:
```dockerfile
RUN curl -fsSL https://bun.sh/install | bash # line 131 — always installs "latest"
RUN curl -fsSL https://tailscale.com/install.sh | sh # line 162 — same
```
Each Docker build may pull a different version, making the image non-deterministic and harder to audit. Consider pinning to a specific release (e.g., `BUN_VERSION=1.x.y` for bun, or the versioned Tailscale `apt` package) so the build is reproducible and the installed version is tracked in source control.
How can I resolve this? If you propose a fix, please make it concise.| // Unauthenticated health check for Fly.io liveness probes. | ||
| if (req.method === "GET" && req.url === "/health") { | ||
| sendJson(res, 200, { ok: true, ts: Date.now() }); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Duplicate
/health handler shadows the existing probe infrastructure
/health is already registered in GATEWAY_PROBE_STATUS_BY_PATH (line 110) and handled by handleGatewayProbeRequest, which returns { ok: true, status: "live" }. The new early-return handler intercepts req.url === "/health" before the stages run and returns a different shape: { ok: true, ts: Date.now() }.
The motivation is valid (bypass loadConfig() which can throw before the app is fully initialised, guaranteeing liveness probes always succeed on startup). If that's the goal, consider either:
- Removing
/healthfromGATEWAY_PROBE_STATUS_BY_PATHto avoid the dead code path, or - Adding a comment explaining why the early handler exists alongside the stage-based one.
As written, a reader has to trace the whole request pipeline to understand why /health is listed in the probe map but never reaches the probe handler.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server-http.ts
Line: 771-775
Comment:
**Duplicate `/health` handler shadows the existing probe infrastructure**
`/health` is already registered in `GATEWAY_PROBE_STATUS_BY_PATH` (line 110) and handled by `handleGatewayProbeRequest`, which returns `{ ok: true, status: "live" }`. The new early-return handler intercepts `req.url === "/health"` before the stages run and returns a different shape: `{ ok: true, ts: Date.now() }`.
The motivation is valid (bypass `loadConfig()` which can throw before the app is fully initialised, guaranteeing liveness probes always succeed on startup). If that's the goal, consider either:
- Removing `/health` from `GATEWAY_PROBE_STATUS_BY_PATH` to avoid the dead code path, or
- Adding a comment explaining why the early handler exists alongside the stage-based one.
As written, a reader has to trace the whole request pipeline to understand why `/health` is listed in the probe map but never reaches the probe handler.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f7249fb694
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const desiredWorkspace = trim(process.env.OPENCLAW_WORKSPACE_DIR) || `${stateDir}/workspace`; | ||
| const agents = ensureObject(config, "agents"); | ||
| const defaults = ensureObject(agents, "defaults"); | ||
| if (defaults.workspace !== desiredWorkspace) { | ||
| defaults.workspace = desiredWorkspace; |
There was a problem hiding this comment.
Preserve existing workspace unless OPENCLAW_WORKSPACE_DIR is set
This block rewrites agents.defaults.workspace on every startup even when no workspace override env var is provided, because it defaults to ${stateDir}/workspace unconditionally. In deployments that already have a custom workspace in openclaw.json, the value is silently clobbered after restart, which can redirect or orphan existing session/workspace data. Only setting this field when OPENCLAW_WORKSPACE_DIR is explicitly provided (or when defaults.workspace is missing) would avoid this regression.
Useful? React with 👍 / 👎.
| sleep 2 | ||
|
|
||
| # Connect to Tailscale network | ||
| tailscale up --authkey="$TAILSCALE_AUTHKEY" --hostname="clawdbot-fly" --accept-routes |
There was a problem hiding this comment.
Prevent Tailscale bootstrap failures from stopping gateway launch
Because the script runs with set -e, a non-zero exit from tailscale up aborts the entrypoint before exec "$@" runs. When TAILSCALE_AUTHKEY is set, transient control-plane/auth failures therefore take down the whole gateway process instead of just degrading Tailscale exposure. This should be retried or handled as non-fatal so the gateway can still start.
Useful? React with 👍 / 👎.
|
Codex review: found issues before merge. What this changes: The PR attempts to show provider-usage fetch errors in chat Maintainer follow-up before merge: The status fix is narrow, but this PR mixes it with security-sensitive Docker/Fly/runtime-config changes and needs maintainer splitting/security review rather than an automated repair on the existing branch. Security review: Security review needs attention: Security review needs attention because the diff introduces container privilege and supply-chain risks unrelated to the Review findings:
Review detailsBest possible solution: Land the status UX improvement as a narrow current-main patch in Do we have a high-confidence way to reproduce the issue? Yes. A focused reproduction is to mock Is this the best way to solve the issue? No. The PR's narrow status idea is right, but the implementation belongs in the shared current renderer and should not be bundled with unrelated deployment, container, workflow, SDK, and template changes. Full review comments:
Overall correctness: patch is incorrect Security concerns:
Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against ef799fd57a77. |
UX Improvement
Improves
/statuscommand feedback by showing why usage information is unavailable instead of silently omitting it.Changes
User Impact
When the usage API is slow, unavailable, or misconfigured, users now see an explicit message like:
📊 Usage: (unavailable: timeout)📊 Usage: (error: API authentication failed)Previously, the usage line would silently disappear, leaving users confused about whether their quota was being tracked.