Skip to content

fix(stability): session skills snapshot, tool-loop guard, TUI watchdog, LM Studio preload backoff#67401

Merged
obviyus merged 6 commits intoopenclaw:mainfrom
xantorres:fix/openclaw-stability
Apr 16, 2026
Merged

fix(stability): session skills snapshot, tool-loop guard, TUI watchdog, LM Studio preload backoff#67401
obviyus merged 6 commits intoopenclaw:mainfrom
xantorres:fix/openclaw-stability

Conversation

@xantorres
Copy link
Copy Markdown
Contributor

@xantorres xantorres commented Apr 15, 2026

Summary

Four stability fixes for issues hit during a single self-hosted OpenClaw + LM Studio debugging session. All are reproducible, low-surface, and include unit tests.

  • Problem: disabling a bundled skill in config still left the model calling it, producing infinite `Tool X not found` loops until the embedded-run timeout. Root cause: the `skillsSnapshot` persisted in `sessions.json` was never invalidated when `skills.*` config changed.
  • Problem: the `unknownToolThreshold` stream guard was gated behind `tools.loopDetection.enabled`, which defaults to `false`. The protection against hallucinated / removed tool calls was effectively off in the stock config.
  • Problem: the TUI `streaming · Xm Ys` indicator never reset when the gateway's `state: "final"` event was lost (WS reconnect, gateway restart, etc.), leaving the TUI stuck indefinitely until killed.
  • Problem: LM Studio's memory guardrail rejecting `POST /v1/models/load` caused OpenClaw to re-hit the endpoint on every chat request (~every 2s), producing hundreds of WARN log lines per hour without useful retry semantics.
  • Why it matters: each of these failure modes amplifies any other local-model hiccup into a session-long stuck state that users have to recover manually.
  • What changed:
    • `src/gateway/config-reload.ts`: bump `skillsSnapshotVersion` when a config diff touches `skills.*`, via a new `shouldInvalidateSkillsSnapshotForPaths` helper wired into the single `applySnapshot` code path (covers both watcher writes and in-process `config.apply`).
    • `src/agents/pi-embedded-runner/run/attempt.ts`: make `resolveUnknownToolGuardThreshold` always return a positive threshold (default 10) regardless of `tools.loopDetection.enabled`. The guard is a pure safety net with no false-positive surface.
    • `src/tui/tui-event-handlers.ts`: add a 30s delta-silence watchdog that resets `activityStatus` to `idle` on timeout and surfaces a short system-log note; exposes `dispose()` + configurable `streamingWatchdogMs` context option.
    • `extensions/lmstudio/src/stream.ts`: add per-`(baseUrl, modelKey, contextLength)` cooldown (5s → 10s → 20s → … → 5min cap) after preload failures; during cooldown the wrapper skips preload entirely and runs the inference stream directly (the model is often already loaded via LM Studio's UI). The log line now carries consecutive-failure count and remaining cooldown.
  • What did NOT change (scope boundary): no schema additions, no new config keys (except the TUI `streamingWatchdogMs` developer option), no changes to the public Plugin SDK, no touches to session persistence format, no changes to the `detectToolCallLoop` / `before-tool-call` dispatcher path.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause (if applicable)

  1. Skills snapshot staleness. `getSkillsSnapshotVersion` is only bumped on filesystem changes to the skills tree (watched via chokidar in `src/agents/skills/refresh.ts`). Config changes to `skills.allowBundled` / `skills.entries.*` / `skills.profile` go through `config-reload` and `config.apply` without ever bumping the version. Sessions keep reading the cached `skillsSnapshot` from `sessions.json`, advertising skills that no longer exist in the active config.

  2. Unknown-tool guard gated behind opt-in flag. Historical commit introduced `tools.loopDetection.enabled` as a coarse toggle for repetition detectors (genericRepeat / pingPong / pollNoProgress). The unknown-tool guard is a different concern — it only triggers when the model is calling a tool that is objectively not registered in this run — but it was wired through the same gate. Default install effectively had no protection against `Tool X not found` loops.

  3. TUI streaming state. `tui-event-handlers.ts` transitions `activityStatus` to `streaming` on `evt.state === "delta"` and back to `idle` / `error` only on the corresponding `final` / `aborted` / `error` event. The gateway does emit a `final` event, but delivery over the WebSocket is not atomic with the run lifecycle — a connection flap or gateway restart between completion and delivery leaves the indicator pinned. There was no client-side fallback.

  4. LM Studio preload. The wrapper dedupes concurrent preloads via `preloadInFlight`, but on failure it immediately drops the entry in `.finally()`, so the next chat request creates a brand-new preload. Combined with no circuit breaker on the failure, a guardrail rejection (memory pressure, model not loadable) produces one failed preload per chat request for as long as the user stays in the session.

Test Plan

  • New unit coverage in each affected area (`config-reload.test.ts`, `attempt.test.ts`, `tui-event-handlers.test.ts`, `extensions/lmstudio/src/stream.test.ts`).
  • `pnpm test ` → 189 / 189 passing.
  • `pnpm tsgo` clean.
  • `pnpm check` clean (0 warnings, 0 errors across oxlint / import-cycle / madge / custom webhook / pairing-scope guards).

Scoped tests cover: skills-version bump wiring + path-prefix helper, threshold override / default / invalid-input / fractional-floor for the unknown-tool guard, watchdog arm / disarm / rearm / dispose / zero-disabled cases, preload cooldown skip + expiry-based retry.

Reviewer notes

  • `bumpSkillsSnapshotVersion` invalidates globally rather than per-session; that matches existing callers from `src/infra/skills-remote.ts` and is the desired semantic here. A config change is intended to affect every agent session under the workspace.
  • I considered adding a dispatcher-side guard in `tool-loop-detection.ts` too, but the observed loops go through the stream-wrapper path before a tool is ever dispatched (the model's output is not a call to a real tool), so that would not have helped.
  • The LM Studio cooldown resets on any successful preload, so healthy sessions are unaffected.
  • The TUI watchdog uses `.unref()` on its timer to avoid blocking process exit, and `dispose()` is exposed for callers that want to cancel it deterministically.
  • No changes to `AGENTS.md` / `CLAUDE.md` / `docs/` since the behavior changes are implementation-level and covered by the existing skills / streaming / LM Studio docs. Happy to add notes if reviewers prefer.

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime agents Agent runtime and tooling extensions: lmstudio size: L labels Apr 15, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 15, 2026

Greptile Summary

Four focused stability fixes for self-hosted LM Studio sessions: skills-snapshot staleness on config change, unknown-tool loop guard gated behind an opt-in flag, TUI streaming indicator getting stuck after lost final events, and LM Studio's preload endpoint being hammered on every request after a failure. Each fix is narrow, well-justified, and accompanied by unit tests covering the key edge cases (arm/disarm/rearm/dispose for the watchdog, backoff skip/expiry for the preload cooldown, threshold default/override/disabled for the tool guard, and path-prefix matching for the snapshot invalidation).

Confidence Score: 5/5

Safe to merge — all remaining findings are P2 style suggestions that do not affect correctness.

All four fixes address real, reproducible failure modes with no false-positive surface. Tests are comprehensive (189/189 passing per the PR), type-check and lint pass, and no public SDK or session-format contracts were changed. The only finding is a minor code-duplication/unreachable-fallback in config-reload.ts.

No files require special attention.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/gateway/config-reload.ts
Line: 43-55

Comment:
**Duplicate prefix-matching logic + unreachable fallback**

`firstSkillsChangedPath` replicates the exact predicate from `shouldInvalidateSkillsSnapshotForPaths`, so when the outer `if` is true the inner call is guaranteed to return a defined value — the `?? "skills.*"` on the log line is dead code. A single helper that returns the first matching path (or `undefined`) would remove the duplication and make the guarantee explicit:

```ts
function firstSkillsChangedPath(paths: string[]): string | undefined {
  return paths.find((p) =>
    SKILLS_INVALIDATION_PREFIXES.some(
      (prefix) => p === prefix || p.startsWith(`${prefix}.`),
    ),
  );
}

// replace shouldInvalidateSkillsSnapshotForPaths:
export function shouldInvalidateSkillsSnapshotForPaths(paths: string[]): boolean {
  return firstSkillsChangedPath(paths) !== undefined;
}
```

With this shape the `?? "skills.*"` fallback inside `applySnapshot` is no longer needed (and can be removed or kept as a defensive note, but at least it's clearly intentional).

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Extensions/lmstudio: back off inference ..." | Re-trigger Greptile

Comment thread src/gateway/config-reload.ts
@xantorres
Copy link
Copy Markdown
Contributor Author

The parity-gate CI job is red on this PR with:

Cannot find module '/home/runner/_work/openclaw/plugin-sdk/qa-lab.js'
imported from /home/runner/_work/openclaw/openclaw/dist/subcli-descriptors-DF5zCXSz.js

I reproduced this on a pristine upstream/main worktree (no patches from this PR) with:

OPENCLAW_BUILD_PRIVATE_QA=1 pnpm build
OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite \
  --provider-mode mock-openai --parity-pack agentic \
  --model openai/gpt-5.4 --alt-model openai/gpt-5.4-alt \
  --output-dir /tmp/qa-main-test

Same ERR_MODULE_NOT_FOUND resolving ../../plugin-sdk/qa-lab.js.

Root cause is the obfuscated dynamic specifier in src/cli/program/private-qa-cli.ts:5-8:

export function loadPrivateQaCliModule(): Promise<Record<string, unknown>> {
  const specifier = ["../../plugin-sdk/", "qa", "-lab.js"].join("");
  return import(specifier) as Promise<Record<string, unknown>>;
}

../../plugin-sdk/qa-lab.js is resolved against the importer's file URL at runtime. In source that is src/cli/program/private-qa-cli.ts, so it lands on src/plugin-sdk/qa-lab.js — correct. After tsdown bundles private-qa-cli.ts into dist/subcli-descriptors-<hash>.js, the same literal ../../plugin-sdk/qa-lab.js now walks up two directories from dist/ → parent of the repo root → the file Node is looking for at /home/runner/_work/openclaw/plugin-sdk/qa-lab.js. The built file lives at dist/plugin-sdk/qa-lab.js, so from the bundled emitter the correct specifier is ./plugin-sdk/qa-lab.js.

The obfuscation ([…].join("")) defeats tsdown's static-import rewriting, so this worked in source-layout dev but fails once the bundled dist/ is what Node actually loads.

Happy to open a follow-up PR that resolves the specifier against the bundled output (either drop the obfuscation and let tsdown rewrite it, or pin the target via new URL("./plugin-sdk/qa-lab.js", import.meta.url) inside private-qa-cli.ts). I didn't include that fix here because it's out of scope for the four stability issues this PR targets, and because the private QA dist is a separate build surface that maintainers own.

All other checks on the PR (check, check-additional, check-docs, build-artifacts, node/bundled shard suites, lmstudio extension-fast, etc.) are green. Happy to rebase if a fix for parity-gate lands on main.

@xantorres
Copy link
Copy Markdown
Contributor Author

Addressed Greptile's P2 on src/gateway/config-reload.ts in be903c4f37:

  • Extracted the prefix predicate into matchesSkillsInvalidationPrefix.
  • shouldInvalidateSkillsSnapshotForPaths now delegates to firstSkillsChangedPath(paths) !== undefined, so the two helpers share one implementation.
  • applySnapshot now reads firstSkillsChangedPath(changedPaths) once and branches on !== undefined, avoiding the double pass over changedPaths and removing the unreachable ?? "skills.*" fallback on the log line.

pnpm tsgo clean. src/gateway/config-reload.test.ts still 47/47 green (existing helper tests + invalidation suite cover both the boolean and the single-pass log path).

@xantorres
Copy link
Copy Markdown
Contributor Author

Linked this PR to the relevant open issues in the description. Mapping for reviewers:

# Issue This PR
#54938 Skills snapshot not invalidated on /restart or gateway restart Closes. Config writes now bump skillsSnapshotVersion; combined with the existing version check in src/agents/agent-command.ts:597-603, sessions rebuild the snapshot on the next turn after a skills.* change.
#62971 工具不存在时进入无限循环 / Tool-not-found infinite loop Closes. Bug 2 in this PR enables the existing unknownToolThreshold stream guard by default (10 consecutive unknown calls → the wrapper rewrites the assistant message to stop retrying). Matches the user's proposed fix in the issue ("超过阈值 … 停止调度").
#49059 Skills snapshot not refreshed for existing sessions after gateway restart Related. The two sub-paths in that issue: Path B (version-check path) is now complete — config changes bump globalVersion, so the snapshot rebuilds. Path A (FS-install without config change) goes through the existing chokidar watcher in src/agents/skills/refresh.ts and is unchanged.
#65346 Skill catalog prompt injection ignores allowBundled / blockedByAllowlist filter Related (probably closes). The snapshot builder (shouldIncludeSkill in src/agents/skills/config.ts:72) already applies allowBundled; the user-visible bug was that existing sessions kept reusing the old snapshot after the allowlist changed. This PR invalidates the cached snapshot, which makes the builder's filter take effect on existing sessions.
#56075 Shared skills should propagate reliably to existing chats Related. Covers the config-driven shared-skills branch of that feature request; the install-driven branch is already handled by the skills watcher.
#41972 Per-agent agents.list[].skills filter not applied in live gateway sessions Related, not fully fixed. The invalidator in this PR only watches skills.* paths. Extending SKILLS_INVALIDATION_PREFIXES (or a more surgical check) to also react to agents.list.*.skills changes is a focused follow-up; happy to open a separate PR if maintainers prefer it split.
#51154 Session hangs at ~198k tokens — infinite tool loop Related. Different trigger (context overflow, not unknown tool). Bug 2 here will break any subsequent loop that routes through an unregistered tool name, but the upstream fix for this issue needs to live in the context-overflow path.

Not listed: I checked for open issues on LM Studio preload / memory guardrail and didn't find any — Bug 4 in this PR appears to be a novel report.

@obviyus
Copy link
Copy Markdown
Contributor

obviyus commented Apr 16, 2026

Follow-up pushed for the TUI watchdog:

  • keep the watchdog bound to the active run only
  • ignore late deltas from an older/timed-out run so they cannot replace the newer run's timer
  • add a regression test covering the stale-run-steals-watchdog case

Commit: 8e87ea539c (fix: keep TUI watchdog bound to active run)

Verification:

  • pnpm test src/tui/tui-event-handlers.test.ts
  • pnpm tsgo
  • repo pnpm check via scripts/committer

@obviyus obviyus self-assigned this Apr 16, 2026
@obviyus obviyus force-pushed the fix/openclaw-stability branch from 8e87ea5 to d4103cc Compare April 16, 2026 12:53
Copy link
Copy Markdown
Contributor

@obviyus obviyus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified the four stability fixes on the latest head: config-driven skills snapshot invalidation, the always-on unknown-tool guard, the TUI streaming watchdog, and LM Studio preload backoff.

Maintainer follow-up: bound the TUI watchdog to the active run only, added the stale-run regression test, and fixed the changelog entries in the active Unreleased block with (#67401) / Thanks @xantorres attribution.

Local gate: pnpm test src/tui/tui-event-handlers.test.ts and pnpm check.

xudaiyanzi pushed a commit to xudaiyanzi/openclaw that referenced this pull request Apr 17, 2026
kvnkho pushed a commit to kvnkho/openclaw that referenced this pull request Apr 17, 2026
Mquarmoc pushed a commit to Mquarmoc/openclaw that referenced this pull request Apr 20, 2026
@xantorres xantorres deleted the fix/openclaw-stability branch April 25, 2026 21:22
lovewanwan pushed a commit to lovewanwan/openclaw that referenced this pull request Apr 28, 2026
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling extensions: lmstudio gateway Gateway runtime size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: 工具不存在时进入无限循环调用无法退出 Skills snapshot not invalidated on /restart or gateway restart

2 participants