Skip to content

fix: add compaction model fallback#74470

Merged
jalehman merged 7 commits intomainfrom
seal-700c7e2b-compaction-fallback-64960
May 1, 2026
Merged

fix: add compaction model fallback#74470
jalehman merged 7 commits intomainfrom
seal-700c7e2b-compaction-fallback-64960

Conversation

@jalehman
Copy link
Copy Markdown
Contributor

What

Implicit embedded compaction now retries fallback-eligible summarization failures through the active session model fallback chain. Explicit agents.defaults.compaction.model remains exact and does not inherit the session fallback chain.

Why

Azure-hosted compaction can reject summarization prompts with content-filter 400s. Without model fallback, overflow recovery can keep retrying the same rejected model. Closes #64960.

Changes

  • Wrap implicit compaction in model fallback
  • Keep fallback selection ephemeral
  • Preserve explicit compaction model semantics
  • Document fallback behavior
  • Add focused regression tests

Testing

  • pnpm test src/agents/pi-embedded-runner/compact.hooks.test.ts -- --reporter=verbose
  • pnpm test src/auto-reply/reply/agent-runner-utils.test.ts src/agents/command/attempt-execution.cli.test.ts -- --reporter=verbose
  • pnpm exec oxfmt --check --threads=1 <touched files>
  • pnpm check:changed

Note: Testbox pnpm check:changed was attempted per maintainer policy, but the local environment was not authenticated with Blacksmith after installing the CLI.

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation agents Agent runtime and tooling size: M maintainer Maintainer-authored PR labels Apr 29, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 29, 2026

Codex review: needs changes before merge.

What this changes:

This PR wraps implicit embedded compaction in the session model fallback chain, threads per-run fallback override state through agent execution and compaction runtime context, preserves failure metadata, and updates docs, changelog, and focused tests.

Required change before merge:

The remaining blocker is a narrow mechanical PR-branch repair with clear files and tests; a maintainer can promote it despite the protected label if they want automation to update the branch.

Security review:

Security review cleared: The diff is limited to runtime fallback wiring, tests, docs, and changelog; it does not add dependencies, CI execution, package lifecycle hooks, secret handling, or downloaded code paths.

Review findings:

  • [P2] Thread fallback overrides into overflow compaction — src/agents/pi-embedded-runner/run.ts:1290
Review details

Best possible solution:

Carry params.modelFallbacksOverride into the overflow buildEmbeddedCompactionRuntimeContext call and add a runner-level overflow regression that proves an exact user model switch override such as [] is preserved, while leaving explicit agents.defaults.compaction.model behavior exact and ephemeral fallback selection unchanged.

Do we have a high-confidence way to reproduce the issue?

Yes. A source-level reproduction is clear from current main plus the PR diff: resolveEffectiveModelFallbacks returns [] for a user model switch, but latest PR head does not put that override into the overflow runtime context before contextEngine.compact delegates to direct compaction.

Is this the best way to solve the issue?

No. The PR's fallback wrapper, failure metadata preservation, and docs direction are appropriate, but the proposed patch is incomplete until the overflow runner path carries the same override and has runner-level coverage.

Full review comments:

  • [P2] Thread fallback overrides into overflow compaction — src/agents/pi-embedded-runner/run.ts:1290
    The patch only passes params.modelFallbacksOverride into timeout recovery. The separate overflow buildEmbeddedCompactionRuntimeContext call still omits it, so a user model switch that intentionally resolves to an empty fallback override can fall back through configured defaults during overflow compaction. Pass the same field in the overflow context and cover that runner path.
    Confidence: 0.91

Overall correctness: patch is incorrect
Overall confidence: 0.89

Acceptance criteria:

  • pnpm test src/agents/pi-embedded-runner/run.overflow-compaction.test.ts src/agents/pi-embedded-runner/compact.hooks.test.ts -- --reporter=verbose
  • pnpm exec oxfmt --check --threads=1 src/agents/pi-embedded-runner/run.ts src/agents/pi-embedded-runner/run.overflow-compaction.test.ts

What I checked:

  • Protected label: The provided GitHub context lists the maintainer label, so this PR must stay open for explicit maintainer handling regardless of cleanup eligibility. (1c6b3a24e860)
  • Timeout path carries the new override: Latest PR head adds modelFallbacksOverride: params.modelFallbacksOverride to the timeout-recovery buildEmbeddedCompactionRuntimeContext call. (src/agents/pi-embedded-runner/run.ts:1290, 1c6b3a24e860)
  • Overflow path still drops the override: The same latest PR head builds overflowCompactionRuntimeContext without passing modelFallbacksOverride, even though it later forwards that runtime context into contextEngine.compact. (src/agents/pi-embedded-runner/run.ts:1430, 1c6b3a24e860)
  • Runtime context supports the field: The PR adds modelFallbacksOverride to buildEmbeddedCompactionRuntimeContext params and output, so the remaining issue is the overflow caller omission rather than a missing type seam. (src/agents/pi-embedded-runner/compaction-runtime-context.ts:91, 1c6b3a24e860)
  • Exact user model switch semantics matter: Current main returns an explicit empty fallback override for non-auto session model overrides, so losing that field changes behavior from exact model selection to config-derived defaults. (src/agents/agent-scope.ts:217, c6cb7b4801c5)
  • Delegate only forwards runtimeContext fields: The context-engine delegate spreads runtimeContext into compactEmbeddedPiSessionDirect, so a field omitted while building the overflow runtime context is not recovered later. (src/context-engine/delegate.ts:54, c6cb7b4801c5)

Likely related people:

  • steipete: Recent history shows work on compaction failure detail, strict user model switch semantics, embedded run lanes, external credential discovery, and model fallback resolution in the central files affected by this PR. (role: recent maintainer and adjacent owner; confidence: high; commits: 714f3b59cce9, d2320e4d4b42, aec5efed8d43; files: src/agents/pi-embedded-runner/compact.ts, src/agents/pi-embedded-runner/run.ts, src/agents/agent-scope.ts)
  • vincentkoc: Recent commits extracted agent-scope and fallback helper seams and maintained live model-switch fallback behavior, which are directly adjacent to the per-run fallback override contract. (role: adjacent owner; confidence: medium; commits: a372e4a15240, 93ce76afe310, da3977e6818b; files: src/agents/agent-scope.ts, src/agents/model-fallback.ts)
  • pashpashpash: Recent main history includes embedded runner and compaction rotation work in the same run/compaction surfaces, making them a useful routing candidate if the runner test needs review. (role: recent embedded runner maintainer; confidence: medium; commits: 439d8edf68e2, b99540964c05; files: src/agents/pi-embedded-runner/run.ts, src/agents/pi-embedded-runner/compact.ts)

Remaining risk / open question:

  • If merged as-is, overflow recovery can ignore an explicit empty fallback override from a user-selected session model and retry compaction through configured default fallbacks.

Codex review notes: model gpt-5.5, reasoning high; reviewed against c6cb7b4801c5.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

This PR wraps implicit embedded compaction in the session model fallback chain so Azure content-filter 400s can recover by retrying on a different model, while keeping explicit agents.defaults.compaction.model behaviour exact and unchanged. The three new regression tests (session fallback triggers, ephemeral selection, and explicit model bypass) adequately cover the key invariants introduced here.

Confidence Score: 4/5

Safe to merge; logic is correct for the targeted Azure content-filter scenario and the explicit compaction model path is unchanged.

No P0/P1 issues found. The fallback classification path in classifyCompactionFallbackResult relies on the error message retaining the HTTP status code prefix (e.g. '400 …') after being serialised through formatErrorMessage → fail(), which holds for Azure's error format and is validated by the tests. The implementation correctly gates on hasExplicitCompactionModel and hasCompactionModelFallbackCandidates, releases the session write lock before each retry, and never writes the fallback choice back to config.

src/agents/pi-embedded-runner/compact.ts — specifically classifyCompactionFallbackResult and the double call to resolveCompactionFallbacksOverride / resolveEmbeddedCompactionTarget.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/compact.ts
Line: 362-363

Comment:
**Status code lost when classifying serialised failure reasons**

`coerceToFailoverError(new Error(reason), …)` receives only the message string produced by `formatErrorMessage`; any `.status` property on the original exception is discarded by the outer `catch``fail(reason)` path in `compactEmbeddedPiSessionDirectOnce`. Classification therefore falls back to `inferSignalStatus`, which parses a leading numeric prefix from the message (e.g. `"400 The response was filtered…"`). This works for the Azure content-filter error targeted by this PR, but an error whose message does not carry the status prefix (e.g. `new Error("Content policy violation")` with `status: 400`) would produce a `null` failover classification and silently short-circuit as a `classifyResult` success, returning `{ ok: false }` to the caller without attempting any fallback model.

Consider preserving the status code in the serialised reason (e.g. via `resolveCompactionFailureReason` or by also checking `err.status` in `classifyCompactionFallbackResult`) so the gate is not sensitive to provider-specific message formatting.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/compact.ts
Line: 382-395

Comment:
**`resolveCompactionFallbacksOverride` and `resolveEmbeddedCompactionTarget` called twice**

`resolveCompactionFallbacksOverride` is invoked once inside `hasCompactionModelFallbackCandidates` and again five lines later; `resolveEmbeddedCompactionTarget` is called here to extract `primaryProvider`/`primaryModel` and then called a second time inside every `compactEmbeddedPiSessionDirectOnce` invocation in the fallback loop. Both are pure/side-effect-free reads, so this is harmless, but capturing the results once would remove the redundancy and make the hot path slightly clearer.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: add compaction model fallback" | Re-trigger Greptile

Comment on lines +362 to +363
const failoverError = coerceToFailoverError(new Error(reason), { provider, model });
return failoverError ? { error: failoverError } : null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Status code lost when classifying serialised failure reasons

coerceToFailoverError(new Error(reason), …) receives only the message string produced by formatErrorMessage; any .status property on the original exception is discarded by the outer catchfail(reason) path in compactEmbeddedPiSessionDirectOnce. Classification therefore falls back to inferSignalStatus, which parses a leading numeric prefix from the message (e.g. "400 The response was filtered…"). This works for the Azure content-filter error targeted by this PR, but an error whose message does not carry the status prefix (e.g. new Error("Content policy violation") with status: 400) would produce a null failover classification and silently short-circuit as a classifyResult success, returning { ok: false } to the caller without attempting any fallback model.

Consider preserving the status code in the serialised reason (e.g. via resolveCompactionFailureReason or by also checking err.status in classifyCompactionFallbackResult) so the gate is not sensitive to provider-specific message formatting.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/compact.ts
Line: 362-363

Comment:
**Status code lost when classifying serialised failure reasons**

`coerceToFailoverError(new Error(reason), …)` receives only the message string produced by `formatErrorMessage`; any `.status` property on the original exception is discarded by the outer `catch``fail(reason)` path in `compactEmbeddedPiSessionDirectOnce`. Classification therefore falls back to `inferSignalStatus`, which parses a leading numeric prefix from the message (e.g. `"400 The response was filtered…"`). This works for the Azure content-filter error targeted by this PR, but an error whose message does not carry the status prefix (e.g. `new Error("Content policy violation")` with `status: 400`) would produce a `null` failover classification and silently short-circuit as a `classifyResult` success, returning `{ ok: false }` to the caller without attempting any fallback model.

Consider preserving the status code in the serialised reason (e.g. via `resolveCompactionFailureReason` or by also checking `err.status` in `classifyCompactionFallbackResult`) so the gate is not sensitive to provider-specific message formatting.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +382 to +395
if (hasExplicitCompactionModel(params) || !hasCompactionModelFallbackCandidates(params)) {
return await compactEmbeddedPiSessionDirectOnce(params);
}
const resolvedCompactionTarget = resolveEmbeddedCompactionTarget({
config: params.config,
provider: params.provider,
modelId: params.model,
authProfileId: params.authProfileId,
defaultProvider: DEFAULT_PROVIDER,
defaultModel: DEFAULT_MODEL,
});
const primaryProvider = resolvedCompactionTarget.provider ?? DEFAULT_PROVIDER;
const primaryModel = resolvedCompactionTarget.model ?? DEFAULT_MODEL;
const fallbacksOverride = resolveCompactionFallbacksOverride(params);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 resolveCompactionFallbacksOverride and resolveEmbeddedCompactionTarget called twice

resolveCompactionFallbacksOverride is invoked once inside hasCompactionModelFallbackCandidates and again five lines later; resolveEmbeddedCompactionTarget is called here to extract primaryProvider/primaryModel and then called a second time inside every compactEmbeddedPiSessionDirectOnce invocation in the fallback loop. Both are pure/side-effect-free reads, so this is harmless, but capturing the results once would remove the redundancy and make the hot path slightly clearer.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/compact.ts
Line: 382-395

Comment:
**`resolveCompactionFallbacksOverride` and `resolveEmbeddedCompactionTarget` called twice**

`resolveCompactionFallbacksOverride` is invoked once inside `hasCompactionModelFallbackCandidates` and again five lines later; `resolveEmbeddedCompactionTarget` is called here to extract `primaryProvider`/`primaryModel` and then called a second time inside every `compactEmbeddedPiSessionDirectOnce` invocation in the fallback loop. Both are pure/side-effect-free reads, so this is harmless, but capturing the results once would remove the redundancy and make the hot path slightly clearer.

How can I resolve this? If you propose a fix, please make it concise.

@jalehman jalehman force-pushed the seal-700c7e2b-compaction-fallback-64960 branch from 7f003d4 to 1c6b3a2 Compare May 1, 2026 19:10
@jalehman jalehman merged commit c098846 into main May 1, 2026
102 checks passed
@jalehman jalehman deleted the seal-700c7e2b-compaction-fallback-64960 branch May 1, 2026 19:15
lxe pushed a commit to lxe/openclaw that referenced this pull request May 6, 2026
* fix: add compaction model fallback

* docs: add compaction changelog pr reference

* docs: add compaction changelog author

* docs: satisfy compaction changelog attribution

* fix: preserve compaction fallback metadata

* fix: satisfy compaction fallback lint

* docs: move compaction fallback changelog entry
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
* fix: add compaction model fallback

* docs: add compaction changelog pr reference

* docs: add compaction changelog author

* docs: satisfy compaction changelog attribution

* fix: preserve compaction fallback metadata

* fix: satisfy compaction fallback lint

* docs: move compaction fallback changelog entry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling docs Improvements or additions to documentation maintainer Maintainer-authored PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compaction fails permanently when Azure content filter blocks summarization — no model fallback

1 participant