fix(stability): session skills snapshot, tool-loop guard, TUI watchdog, LM Studio preload backoff by xantorres · Pull Request #67401 · openclaw/openclaw

xantorres · 2026-04-15T21:28:16Z

Summary

Four stability fixes for issues hit during a single self-hosted OpenClaw + LM Studio debugging session. All are reproducible, low-surface, and include unit tests.

Problem: disabling a bundled skill in config still left the model calling it, producing infinite `Tool X not found` loops until the embedded-run timeout. Root cause: the `skillsSnapshot` persisted in `sessions.json` was never invalidated when `skills.*` config changed.
Problem: the `unknownToolThreshold` stream guard was gated behind `tools.loopDetection.enabled`, which defaults to `false`. The protection against hallucinated / removed tool calls was effectively off in the stock config.
Problem: the TUI `streaming · Xm Ys` indicator never reset when the gateway's `state: "final"` event was lost (WS reconnect, gateway restart, etc.), leaving the TUI stuck indefinitely until killed.
Problem: LM Studio's memory guardrail rejecting `POST /v1/models/load` caused OpenClaw to re-hit the endpoint on every chat request (~every 2s), producing hundreds of WARN log lines per hour without useful retry semantics.
Why it matters: each of these failure modes amplifies any other local-model hiccup into a session-long stuck state that users have to recover manually.
What changed:
- `src/gateway/config-reload.ts`: bump `skillsSnapshotVersion` when a config diff touches `skills.*`, via a new `shouldInvalidateSkillsSnapshotForPaths` helper wired into the single `applySnapshot` code path (covers both watcher writes and in-process `config.apply`).
- `src/agents/pi-embedded-runner/run/attempt.ts`: make `resolveUnknownToolGuardThreshold` always return a positive threshold (default 10) regardless of `tools.loopDetection.enabled`. The guard is a pure safety net with no false-positive surface.
- `src/tui/tui-event-handlers.ts`: add a 30s delta-silence watchdog that resets `activityStatus` to `idle` on timeout and surfaces a short system-log note; exposes `dispose()` + configurable `streamingWatchdogMs` context option.
- `extensions/lmstudio/src/stream.ts`: add per-`(baseUrl, modelKey, contextLength)` cooldown (5s → 10s → 20s → … → 5min cap) after preload failures; during cooldown the wrapper skips preload entirely and runs the inference stream directly (the model is often already loaded via LM Studio's UI). The log line now carries consecutive-failure count and remaining cooldown.
What did NOT change (scope boundary): no schema additions, no new config keys (except the TUI `streamingWatchdogMs` developer option), no changes to the public Plugin SDK, no touches to session persistence format, no changes to the `detectToolCallLoop` / `before-tool-call` dispatcher path.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Skills snapshot staleness. `getSkillsSnapshotVersion` is only bumped on filesystem changes to the skills tree (watched via chokidar in `src/agents/skills/refresh.ts`). Config changes to `skills.allowBundled` / `skills.entries.*` / `skills.profile` go through `config-reload` and `config.apply` without ever bumping the version. Sessions keep reading the cached `skillsSnapshot` from `sessions.json`, advertising skills that no longer exist in the active config.
Unknown-tool guard gated behind opt-in flag. Historical commit introduced `tools.loopDetection.enabled` as a coarse toggle for repetition detectors (genericRepeat / pingPong / pollNoProgress). The unknown-tool guard is a different concern — it only triggers when the model is calling a tool that is objectively not registered in this run — but it was wired through the same gate. Default install effectively had no protection against `Tool X not found` loops.
TUI streaming state. `tui-event-handlers.ts` transitions `activityStatus` to `streaming` on `evt.state === "delta"` and back to `idle` / `error` only on the corresponding `final` / `aborted` / `error` event. The gateway does emit a `final` event, but delivery over the WebSocket is not atomic with the run lifecycle — a connection flap or gateway restart between completion and delivery leaves the indicator pinned. There was no client-side fallback.
LM Studio preload. The wrapper dedupes concurrent preloads via `preloadInFlight`, but on failure it immediately drops the entry in `.finally()`, so the next chat request creates a brand-new preload. Combined with no circuit breaker on the failure, a guardrail rejection (memory pressure, model not loadable) produces one failed preload per chat request for as long as the user stays in the session.

Test Plan

New unit coverage in each affected area (`config-reload.test.ts`, `attempt.test.ts`, `tui-event-handlers.test.ts`, `extensions/lmstudio/src/stream.test.ts`).
`pnpm test ` → 189 / 189 passing.
`pnpm tsgo` clean.
`pnpm check` clean (0 warnings, 0 errors across oxlint / import-cycle / madge / custom webhook / pairing-scope guards).

Scoped tests cover: skills-version bump wiring + path-prefix helper, threshold override / default / invalid-input / fractional-floor for the unknown-tool guard, watchdog arm / disarm / rearm / dispose / zero-disabled cases, preload cooldown skip + expiry-based retry.

Reviewer notes

`bumpSkillsSnapshotVersion` invalidates globally rather than per-session; that matches existing callers from `src/infra/skills-remote.ts` and is the desired semantic here. A config change is intended to affect every agent session under the workspace.
I considered adding a dispatcher-side guard in `tool-loop-detection.ts` too, but the observed loops go through the stream-wrapper path before a tool is ever dispatched (the model's output is not a call to a real tool), so that would not have helped.
The LM Studio cooldown resets on any successful preload, so healthy sessions are unaffected.
The TUI watchdog uses `.unref()` on its timer to avoid blocking process exit, and `dispose()` is exposed for callers that want to cancel it deterministically.
No changes to `AGENTS.md` / `CLAUDE.md` / `docs/` since the behavior changes are implementation-level and covered by the existing skills / streaming / LM Studio docs. Happy to add notes if reviewers prefer.

greptile-apps · 2026-04-15T21:38:30Z

Greptile Summary

Four focused stability fixes for self-hosted LM Studio sessions: skills-snapshot staleness on config change, unknown-tool loop guard gated behind an opt-in flag, TUI streaming indicator getting stuck after lost final events, and LM Studio's preload endpoint being hammered on every request after a failure. Each fix is narrow, well-justified, and accompanied by unit tests covering the key edge cases (arm/disarm/rearm/dispose for the watchdog, backoff skip/expiry for the preload cooldown, threshold default/override/disabled for the tool guard, and path-prefix matching for the snapshot invalidation).

Confidence Score: 5/5

Safe to merge — all remaining findings are P2 style suggestions that do not affect correctness.

All four fixes address real, reproducible failure modes with no false-positive surface. Tests are comprehensive (189/189 passing per the PR), type-check and lint pass, and no public SDK or session-format contracts were changed. The only finding is a minor code-duplication/unreachable-fallback in config-reload.ts.

No files require special attention.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/gateway/config-reload.ts
Line: 43-55

Comment:
**Duplicate prefix-matching logic + unreachable fallback**

`firstSkillsChangedPath` replicates the exact predicate from `shouldInvalidateSkillsSnapshotForPaths`, so when the outer `if` is true the inner call is guaranteed to return a defined value — the `?? "skills.*"` on the log line is dead code. A single helper that returns the first matching path (or `undefined`) would remove the duplication and make the guarantee explicit:

```ts
function firstSkillsChangedPath(paths: string[]): string | undefined {
  return paths.find((p) =>
    SKILLS_INVALIDATION_PREFIXES.some(
      (prefix) => p === prefix || p.startsWith(`${prefix}.`),
    ),
  );
}

// replace shouldInvalidateSkillsSnapshotForPaths:
export function shouldInvalidateSkillsSnapshotForPaths(paths: string[]): boolean {
  return firstSkillsChangedPath(paths) !== undefined;
}
```

With this shape the `?? "skills.*"` fallback inside `applySnapshot` is no longer needed (and can be removed or kept as a defensive note, but at least it's clearly intentional).

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "Extensions/lmstudio: back off inference ..." | Re-trigger Greptile}

xantorres · 2026-04-15T21:41:54Z

The parity-gate CI job is red on this PR with:

Cannot find module '/home/runner/_work/openclaw/plugin-sdk/qa-lab.js'
imported from /home/runner/_work/openclaw/openclaw/dist/subcli-descriptors-DF5zCXSz.js

I reproduced this on a pristine upstream/main worktree (no patches from this PR) with:

OPENCLAW_BUILD_PRIVATE_QA=1 pnpm build
OPENCLAW_BUILD_PRIVATE_QA=1 pnpm openclaw qa suite \
  --provider-mode mock-openai --parity-pack agentic \
  --model openai/gpt-5.4 --alt-model openai/gpt-5.4-alt \
  --output-dir /tmp/qa-main-test

Same ERR_MODULE_NOT_FOUND resolving ../../plugin-sdk/qa-lab.js.

Root cause is the obfuscated dynamic specifier in src/cli/program/private-qa-cli.ts:5-8:

export function loadPrivateQaCliModule(): Promise<Record<string, unknown>> {
  const specifier = ["../../plugin-sdk/", "qa", "-lab.js"].join("");
  return import(specifier) as Promise<Record<string, unknown>>;
}

../../plugin-sdk/qa-lab.js is resolved against the importer's file URL at runtime. In source that is src/cli/program/private-qa-cli.ts, so it lands on src/plugin-sdk/qa-lab.js — correct. After tsdown bundles private-qa-cli.ts into dist/subcli-descriptors-<hash>.js, the same literal ../../plugin-sdk/qa-lab.js now walks up two directories from dist/ → parent of the repo root → the file Node is looking for at /home/runner/_work/openclaw/plugin-sdk/qa-lab.js. The built file lives at dist/plugin-sdk/qa-lab.js, so from the bundled emitter the correct specifier is ./plugin-sdk/qa-lab.js.

The obfuscation ([…].join("")) defeats tsdown's static-import rewriting, so this worked in source-layout dev but fails once the bundled dist/ is what Node actually loads.

Happy to open a follow-up PR that resolves the specifier against the bundled output (either drop the obfuscation and let tsdown rewrite it, or pin the target via new URL("./plugin-sdk/qa-lab.js", import.meta.url) inside private-qa-cli.ts). I didn't include that fix here because it's out of scope for the four stability issues this PR targets, and because the private QA dist is a separate build surface that maintainers own.

All other checks on the PR (check, check-additional, check-docs, build-artifacts, node/bundled shard suites, lmstudio extension-fast, etc.) are green. Happy to rebase if a fix for parity-gate lands on main.

xantorres · 2026-04-15T21:43:39Z

Addressed Greptile's P2 on src/gateway/config-reload.ts in be903c4f37:

Extracted the prefix predicate into matchesSkillsInvalidationPrefix.
shouldInvalidateSkillsSnapshotForPaths now delegates to firstSkillsChangedPath(paths) !== undefined, so the two helpers share one implementation.
applySnapshot now reads firstSkillsChangedPath(changedPaths) once and branches on !== undefined, avoiding the double pass over changedPaths and removing the unreachable ?? "skills.*" fallback on the log line.

pnpm tsgo clean. src/gateway/config-reload.test.ts still 47/47 green (existing helper tests + invalidation suite cover both the boolean and the single-pass log path).

xantorres · 2026-04-15T21:48:03Z

Linked this PR to the relevant open issues in the description. Mapping for reviewers:

#	Issue	This PR
#54938	Skills snapshot not invalidated on /restart or gateway restart	Closes. Config writes now bump `skillsSnapshotVersion`; combined with the existing version check in `src/agents/agent-command.ts:597-603`, sessions rebuild the snapshot on the next turn after a `skills.*` change.
#62971	工具不存在时进入无限循环 / Tool-not-found infinite loop	Closes. Bug 2 in this PR enables the existing `unknownToolThreshold` stream guard by default (10 consecutive unknown calls → the wrapper rewrites the assistant message to stop retrying). Matches the user's proposed fix in the issue ("超过阈值 … 停止调度").
#49059	Skills snapshot not refreshed for existing sessions after gateway restart	Related. The two sub-paths in that issue: Path B (version-check path) is now complete — config changes bump `globalVersion`, so the snapshot rebuilds. Path A (FS-install without config change) goes through the existing chokidar watcher in `src/agents/skills/refresh.ts` and is unchanged.
#65346	Skill catalog prompt injection ignores allowBundled / blockedByAllowlist filter	Related (probably closes). The snapshot builder (`shouldIncludeSkill` in `src/agents/skills/config.ts:72`) already applies `allowBundled`; the user-visible bug was that existing sessions kept reusing the old snapshot after the allowlist changed. This PR invalidates the cached snapshot, which makes the builder's filter take effect on existing sessions.
#56075	Shared skills should propagate reliably to existing chats	Related. Covers the config-driven shared-skills branch of that feature request; the install-driven branch is already handled by the skills watcher.
#41972	Per-agent `agents.list[].skills` filter not applied in live gateway sessions	Related, not fully fixed. The invalidator in this PR only watches `skills.` paths. Extending `SKILLS_INVALIDATION_PREFIXES` (or a more surgical check) to also react to `agents.list..skills` changes is a focused follow-up; happy to open a separate PR if maintainers prefer it split.
#51154	Session hangs at ~198k tokens — infinite tool loop	Related. Different trigger (context overflow, not unknown tool). Bug 2 here will break any subsequent loop that routes through an unregistered tool name, but the upstream fix for this issue needs to live in the context-overflow path.

Not listed: I checked for open issues on LM Studio preload / memory guardrail and didn't find any — Bug 4 in this PR appears to be a novel report.

obviyus · 2026-04-16T12:41:48Z

Follow-up pushed for the TUI watchdog:

keep the watchdog bound to the active run only
ignore late deltas from an older/timed-out run so they cannot replace the newer run's timer
add a regression test covering the stale-run-steals-watchdog case

Commit: 8e87ea539c (fix: keep TUI watchdog bound to active run)

Verification:

pnpm test src/tui/tui-event-handlers.test.ts
pnpm tsgo
repo pnpm check via scripts/committer

obviyus

Verified the four stability fixes on the latest head: config-driven skills snapshot invalidation, the always-on unknown-tool guard, the TUI streaming watchdog, and LM Studio preload backoff.

Maintainer follow-up: bound the TUI watchdog to the active run only, added the stale-run regression test, and fixed the changelog entries in the active Unreleased block with (#67401) / Thanks @xantorres attribution.

Local gate: pnpm test src/tui/tui-event-handlers.test.ts and pnpm check.

…delta silence

…lures

…xantorres

…xantorres)

…xantorres

…xantorres)

…xantorres

…xantorres)

…xantorres

…xantorres)

…xantorres

…xantorres)

…xantorres

…xantorres)

openclaw-barnacle Bot added gateway Gateway runtime agents Agent runtime and tooling extensions: lmstudio size: L labels Apr 15, 2026

greptile-apps Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread src/gateway/config-reload.ts

xantorres force-pushed the fix/openclaw-stability branch 4 times, most recently from a037a22 to bbbd16d Compare April 16, 2026 11:53

obviyus self-assigned this Apr 16, 2026

obviyus force-pushed the fix/openclaw-stability branch from 8e87ea5 to d4103cc Compare April 16, 2026 12:53

obviyus approved these changes Apr 16, 2026

View reviewed changes

xantorres added 5 commits April 16, 2026 18:27

Gateway/skills: invalidate session skills snapshot on config write

de31b7b

Agents/tool-loop: enable unknown-tool stream guard by default

862a3d6

TUI/streaming: add watchdog that resets the activity indicator after …

55b8f63

…delta silence

Extensions/lmstudio: back off inference preload after consecutive fai…

efb33bf

…lures

Gateway/skills: dedupe skills prefix-match + drop dead fallback on log

3e0042b

xudaiyanzi pushed a commit to xudaiyanzi/openclaw that referenced this pull request Apr 17, 2026

fix: keep TUI watchdog bound to active run (openclaw#67401) (thanks @…

3ea7710

…xantorres)

kvnkho pushed a commit to kvnkho/openclaw that referenced this pull request Apr 17, 2026

fix: keep TUI watchdog bound to active run (openclaw#67401) (thanks @…

a143325

…xantorres)

Mquarmoc pushed a commit to Mquarmoc/openclaw that referenced this pull request Apr 20, 2026

fix: keep TUI watchdog bound to active run (openclaw#67401) (thanks @…

1dff86f

…xantorres)

jason-zimmer mentioned this pull request Apr 23, 2026

fix(tui): reset activity after late terminal events #66192

Closed

xantorres deleted the fix/openclaw-stability branch April 25, 2026 21:22

This was referenced Apr 26, 2026

fix:(tui) reset stale streaming state after event gaps #62602

Closed

fix(tui): recover activity status when no runs are in flight #64862

Closed

TUI streaming indicator stays active long after response content finishes #67052

Open

lovewanwan pushed a commit to lovewanwan/openclaw that referenced this pull request Apr 28, 2026

fix: keep TUI watchdog bound to active run (openclaw#67401) (thanks @…

70e79e1

…xantorres)

amoghasgekar mentioned this pull request May 2, 2026

perf(sessions): stop persisting skillsSnapshot.resolvedSkills #75960

Merged

8 tasks

ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026

fix: keep TUI watchdog bound to active run (openclaw#67401) (thanks @…

a6a123d

…xantorres)

github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026

fix: keep TUI watchdog bound to active run (openclaw#67401) (thanks @…

25a4ce3

…xantorres)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(stability): session skills snapshot, tool-loop guard, TUI watchdog, LM Studio preload backoff#67401

fix(stability): session skills snapshot, tool-loop guard, TUI watchdog, LM Studio preload backoff#67401
obviyus merged 6 commits intoopenclaw:mainfrom
xantorres:fix/openclaw-stability

xantorres commented Apr 15, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 15, 2026

Uh oh!

Uh oh!

xantorres commented Apr 15, 2026

Uh oh!

xantorres commented Apr 15, 2026

Uh oh!

xantorres commented Apr 15, 2026

Uh oh!

obviyus commented Apr 16, 2026

Uh oh!

obviyus left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

xantorres commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Test Plan

Reviewer notes

Uh oh!

greptile-apps Bot commented Apr 15, 2026

Greptile Summary

Confidence Score: 5/5

Uh oh!

Uh oh!

xantorres commented Apr 15, 2026

Uh oh!

xantorres commented Apr 15, 2026

Uh oh!

xantorres commented Apr 15, 2026

Uh oh!

obviyus commented Apr 16, 2026

Uh oh!

obviyus left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xantorres commented Apr 15, 2026 •

edited

Loading