Skip to content

fix(cli): clarify completion cache timeout message after openclaw update#72850

Merged
steipete merged 1 commit into
openclaw:mainfrom
iot2edge:fix/update-completion-cache-timeout-message
Apr 27, 2026
Merged

fix(cli): clarify completion cache timeout message after openclaw update#72850
steipete merged 1 commit into
openclaw:mainfrom
iot2edge:fix/update-completion-cache-timeout-message

Conversation

@iot2edge

Copy link
Copy Markdown
Contributor

Fixes #72842

Summary

  • Problem: When openclaw update finishes, it spawns a child process to refresh the shell completion cache with a 30-second timeout. On slower environments (Debian/Docker, slow disk, large bundled-plugin trees) that child can exceed 30s and is killed. The user sees Completion cache update failed: Error: spawnSync /usr/bin/node ETIMEDOUT — Node's raw error string with no context.
  • Why it matters: The reporter of [Bug]: Why have the last two upgrades both reported the error: Completion cache update failed: Error: spawnSync /usr/bin/node ETIMEDOUT? #72842 saw this error twice in a row and explicitly asked "why does this happen?" The current message exposes a low-level Node error without telling the user (a) it's non-fatal, (b) what the cache does, or (c) how to recover. It looks alarming for what is actually a benign tab-completion staleness.
  • What changed: In tryWriteCompletionCache (src/cli/update-cli/shared.ts), when the spawned child returns result.error, detect ETIMEDOUT and surface "timed out after 30s" instead of the raw Error: spawnSync /usr/bin/node ETIMEDOUT. Both the result.error and the non-zero result.status branches now also append a single-line manual-refresh hint: "Shell tab-completion may be stale; refresh manually with: openclaw completion --write-state". Added a regression test in src/cli/update-cli.test.ts that asserts the friendly text is logged and the raw Error: spawnSync string is not.
  • What did NOT change (scope boundary): The 30-second timeout itself, the spawnSync call shape, the OPENCLAW_COMPLETION_SKIP_PLUGIN_COMMANDS env var, the silent return on error (still non-fatal), JSON-mode silence, and the duplicate-but-separate COMPLETION_CACHE_WRITE_TIMEOUT_MS constant in src/commands/doctor-completion.ts. Did not touch the underlying performance issue (fast plugin-load on cold start) — that is environment-specific and out of scope for a UX fix.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause

  • Root cause: After a successful openclaw update, update-command.ts calls tryWriteCompletionCache(postUpdateRoot, jsonMode). That function spawns node openclaw.mjs completion --write-state with a 30-second timeout. When the child takes longer than 30s (slow disk, bundled plugin tree cold-load, Docker overlayfs), Node's spawnSync kills the child and returns {error: SystemError(code: "ETIMEDOUT")}. The existing handler interpolates that error directly into the user-facing log: Completion cache update failed: ${String(result.error)}. The result string is Error: spawnSync /usr/bin/node ETIMEDOUT — accurate but unhelpful.
  • Missing detection / guardrail: there was no test asserting the user-visible message text on the timeout/error path; the only existing coverage at src/cli/update-cli.test.ts:534 checked the spawn invocation arguments. So the message wording was never exercised.
  • Contributing context (if known): the same 30-second timeout pattern is duplicated in src/commands/doctor-completion.ts:19-41, but that path uses result.status === 0 only and doesn't surface the error to the user, so the unhelpful message was Update-only.

Regression Test Plan

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/cli/update-cli.test.ts (in the same describe block as "bounds completion cache refresh during update follow-up").
  • Scenario the test should lock in:
    1. spawnSync returns {error: <ETIMEDOUT>} (mocked once).
    2. Logged warning includes "timed out after 30s".
    3. Logged warning includes the manual-refresh hint "openclaw completion --write-state".
    4. Logged warning does not include the raw "Error: spawnSync" string (regression guard against falling back to String(result.error)).
  • Why this is the smallest reliable guardrail: tryWriteCompletionCache is a small async function with one external call (spawnSync). Mocking that single seam exercises every branch deterministically; no real subprocess is needed and the test runs in milliseconds.
  • Existing test that already covers this (if any): None — bounds completion cache refresh during update follow-up only asserts the spawn invocation arguments, not the error path or message text.
  • If no new test is added, why not: N/A — 1 new test added.

User-visible / Behavior Changes

  • Before:
    Completion cache update failed: Error: spawnSync /usr/bin/node ETIMEDOUT
    
  • After:
    Completion cache update failed: timed out after 30s. Shell tab-completion may be stale; refresh manually with: openclaw completion --write-state
    
  • For non-timeout errors (rare), the message now also appends the manual-refresh hint. For non-zero exit status, same.
  • JSON mode (--json) still suppresses the message entirely. Behavior for the success path is unchanged.
  • The function is still non-fatal (returns silently after logging); openclaw update still reports overall success.

Diagram

openclaw update flow

Before:
[update succeeds]
  -> tryWriteCompletionCache spawns child (30s timeout)
  -> child times out (ETIMEDOUT)
  -> log: "Completion cache update failed: Error: spawnSync /usr/bin/node ETIMEDOUT"
  -> user: "is my install broken?"

After:
[update succeeds]
  -> tryWriteCompletionCache spawns child (30s timeout)
  -> child times out (ETIMEDOUT)
  -> log: "Completion cache update failed: timed out after 30s.
          Shell tab-completion may be stale; refresh manually with:
          openclaw completion --write-state"
  -> user knows it's harmless and how to recover

Security Impact

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No (same spawnSync shape)
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Debian (Docker), per the issue reporter
  • Runtime/container: Node 22+, npm-installed openclaw global
  • Model/provider: N/A — this is a CLI flow, not a model call
  • Integration/channel (if any): N/A
  • Relevant config (redacted): None

Steps

  1. Install openclaw via npm install -g openclaw@latest in an environment where the bundled-plugin-tree cold-load takes >30s (e.g. Docker on slow disk).
  2. Run openclaw update.
  3. Observe the closing log line.

Expected

  • Update reports overall success.
  • Closing message: Completion cache update failed: timed out after 30s. Shell tab-completion may be stale; refresh manually with: openclaw completion --write-state.

Actual (before fix)

  • Update reports overall success.
  • Closing message: Completion cache update failed: Error: spawnSync /usr/bin/node ETIMEDOUT — opaque and alarming.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)
$ pnpm test src/cli/update-cli.test.ts
 Test Files  1 passed (1)
      Tests  63 passed (63)   # 62 existing + 1 new for the friendly-timeout message

pnpm check:changed is green: conflict markers, typecheck, lint, runtime import cycles, plus the various core repo guard checks all pass.

Human Verification (required)

  • Verified scenarios:
    • Targeted vitest run for src/cli/update-cli.test.ts (63/63 pass locally).
    • Full pnpm check:changed gate (all lanes green, including the additional core guards that fire when src/cli/** is touched).
    • Re-read the function before and after the change to confirm the success and JSON-mode paths are untouched.
  • Edge cases checked:
    • result.error with code: "ETIMEDOUT" → friendly message with manual-refresh hint.
    • result.error without code: "ETIMEDOUT" → falls back to String(result.error), still appends manual-refresh hint.
    • Non-zero result.status (child ran but exited with error) → keeps original (${stderr}) detail and appends manual-refresh hint.
    • JSON mode (jsonMode: true) → no message logged in any branch.
    • Successful run (status: 0, no error) → no message logged (unchanged).
  • What you did not verify:
    • Reproducing the original ETIMEDOUT on a real Debian/Docker install. The fix is purely user-facing message text and is fully covered by the unit test mock; reproducing the underlying performance issue requires a slow-disk environment that I do not have access to. The test asserts the exact branch the reporter hit.
    • The duplicate COMPLETION_CACHE_WRITE_TIMEOUT_MS in src/commands/doctor-completion.ts. That path doesn't surface a user-facing failure message and is therefore not affected by the bug; harmonizing the two constants would be a separate refactor and is out of scope for this PR.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

(Both will be checked once review activity lands. Currently no bot review conversations on this PR.)

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: a downstream tool that scrapes the existing Error: spawnSync /usr/bin/node ETIMEDOUT string will no longer match.
    • Mitigation: this is a warning log for human consumption, not part of any documented machine-readable surface. JSON mode (--json), which is what scriptable callers use, is unchanged and still emits no completion-cache message at all.
  • Risk: the manual-refresh hint slightly lengthens the warning line on narrow terminals.
    • Mitigation: it's still a single line and warning-level (yellow). The added information directly answers the question the reporter asked. If a maintainer prefers a two-line layout, splitting on theme.warn(...) is a one-line follow-up.

@openclaw-barnacle openclaw-barnacle Bot added channel: telegram Channel integration: telegram cli CLI command changes size: S labels Apr 27, 2026
@greptile-apps

greptile-apps Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR improves the user-facing error message when the shell-completion cache write times out after openclaw update, replacing the opaque Error: spawnSync /usr/bin/node ETIMEDOUT with a friendly "timed out after 30s" string and a manual-refresh hint. It also bundles two unrelated fixes: dynamic 1 M context-window resolution for GitHub Copilot models with -1m/-1m-internal suffixes, and a display-label correction for nested provider model IDs (OpenRouter) in the Telegram extension.

Confidence Score: 4/5

Safe to merge; the core fix is correct and well-tested, with one low-priority question about an "internal" model ID in public defaults.

All three changes are small, targeted, and covered by new tests. The only finding is a P2 question about claude-opus-4.7-1m-internal appearing in the default Copilot model list — not a bug, but worth confirming intent before it ships to all users.

extensions/github-copilot/models-defaults.ts — confirm the "internal" model ID in DEFAULT_MODEL_IDS is intentional.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/github-copilot/models-defaults.ts
Line: 11

Comment:
**"internal" model ID in public defaults**

`claude-opus-4.7-1m-internal` contains "internal" in its name, which normally signals a model not intended for broad user exposure. The comment above the list says unknown model IDs are harmless (Copilot returns an error and users can remove it), but surfacing an internal-tagged model in the default list could surprise users who see it appear in their model picker or wonder why it errors out. Was this intentional, or should it be gated behind a separate config opt-in?

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(cli): clarify completion cache timeo..." | Re-trigger Greptile

const DEFAULT_MAX_TOKENS = 8192;

// Copilot model ids vary by plan/org and can change.
// We keep this list intentionally broad; if a model isn't available Copilot will
// return an error and users can remove it from their config.
const DEFAULT_MODEL_IDS = [
"claude-opus-4.7",
"claude-opus-4.7-1m-internal",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 "internal" model ID in public defaults

claude-opus-4.7-1m-internal contains "internal" in its name, which normally signals a model not intended for broad user exposure. The comment above the list says unknown model IDs are harmless (Copilot returns an error and users can remove it), but surfacing an internal-tagged model in the default list could surprise users who see it appear in their model picker or wonder why it errors out. Was this intentional, or should it be gated behind a separate config opt-in?

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/github-copilot/models-defaults.ts
Line: 11

Comment:
**"internal" model ID in public defaults**

`claude-opus-4.7-1m-internal` contains "internal" in its name, which normally signals a model not intended for broad user exposure. The comment above the list says unknown model IDs are harmless (Copilot returns an error and users can remove it), but surfacing an internal-tagged model in the default list could surprise users who see it appear in their model picker or wonder why it errors out. Was this intentional, or should it be gated behind a separate config opt-in?

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@iot2edge iot2edge force-pushed the fix/update-completion-cache-timeout-message branch from b3d54e7 to 4161b24 Compare April 27, 2026 14:42
@openclaw-barnacle openclaw-barnacle Bot added size: XS and removed channel: telegram Channel integration: telegram size: S labels Apr 27, 2026
@iot2edge

Copy link
Copy Markdown
Contributor Author

claude-opus-4.7-1m-internal is the exact id that GitHub Copilot's /models endpoint advertises (per the evidence in #72805 — the reporter pasted the upstream listing). It's labeled "Internal only" by Copilot's own UI, meaning it's gated by plan/org rather than a hidden test model. Following the file's existing policy — "Copilot model ids vary by plan/org and can change. We keep this list intentionally broad; if a model isn't available Copilot will return an error and users can remove it from their config." — including it is consistent with how o1 and other plan-restricted entries are already shipped. If maintainers prefer to gate it behind a separate opt-in, happy to split.

@steipete steipete force-pushed the fix/update-completion-cache-timeout-message branch 4 times, most recently from ce9f837 to c6b9e13 Compare April 27, 2026 20:09
When the post-update completion cache refresh times out (slow disk,
large bundled plugin tree, Docker overlayfs), the user previously saw
the opaque 'Completion cache update failed: Error: spawnSync
/usr/bin/node ETIMEDOUT'. Detect ETIMEDOUT specifically, surface
'timed out after 30s', and append a manual refresh hint pointing at
'openclaw completion --write-state' so users know it's non-fatal and
how to recover.

Fixes openclaw#72842
@steipete steipete force-pushed the fix/update-completion-cache-timeout-message branch from c6b9e13 to cec39ab Compare April 27, 2026 20:12
@steipete steipete merged commit 9892838 into openclaw:main Apr 27, 2026
67 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI command changes size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Why have the last two upgrades both reported the error: Completion cache update failed: Error: spawnSync /usr/bin/node ETIMEDOUT?

2 participants