fix(reply): block risky smaller-context model switches#44389
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5bc73f537c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Greptile SummaryThis PR introduces a preflight guard ( Key changes:
Issues found:
Confidence Score: 4/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/auto-reply/reply/model-switch-guard.ts
Line: 7-15
Comment:
**`totalTokensFresh` is picked but never consulted**
`resolveRecordedSessionTokens` accepts `Pick<SessionEntry, "totalTokens" | "totalTokensFresh">` but the function body only ever reads `entry?.totalTokens` — `totalTokensFresh` is completely ignored. This means the guard fires on stale token snapshots too, which the PR risk section acknowledges could block a legitimate switch.
The inclusion of `totalTokensFresh` in the `Pick` strongly implies it was meant to be used (e.g. skip the guard when the snapshot is stale and the real usage may have dropped). If that was the intent, the staleness check is missing:
```ts
function resolveRecordedSessionTokens(
entry?: Pick<SessionEntry, "totalTokens" | "totalTokensFresh"> | null,
): number | undefined {
const total = entry?.totalTokens;
if (typeof total !== "number" || !Number.isFinite(total) || total <= 0) {
return undefined;
}
// Only treat the snapshot as actionable when it is fresh.
if (entry?.totalTokensFresh !== true) {
return undefined;
}
return Math.floor(total);
}
```
If the decision is intentionally fail-safe (always block on any recorded total), removing `totalTokensFresh` from the `Pick` would make that intent explicit and avoid misleading future readers. The test suite only covers `totalTokensFresh: true`, so the stale-snapshot behaviour is currently untested either way.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/auto-reply/reply/model-switch-guard.ts
Line: 25-30
Comment:
**Asymmetric `trim()` in same-model short-circuit**
`currentProvider` and `currentModel` are trimmed before comparison, but `targetProvider` and `targetModel` are not. If a target string ever carried leading/trailing whitespace (e.g. from a loosely-parsed directive), the equality check would fail and the guard would proceed unnecessarily. Consider trimming both sides for consistency:
```suggestion
if (
params.currentProvider?.trim() === params.targetProvider?.trim() &&
params.currentModel?.trim() === params.targetModel?.trim()
) {
return undefined;
}
```
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: 5bc73f5 |
70fd0f4 to
5ed57ce
Compare
|
Rebased onto current Local verification on the rebased branch:
The remaining failing check is |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ed57ce9d3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7bf6d2d588
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d8fd0e2b68
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f26229534c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Follow-up on the latest head ( Local verification on this head:
I also ran the broad One GitHub Actions note: the upstream |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 52eaa4d3fd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
52eaa4d to
21a2af3
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5c56ed0f4b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
New head: 85d229a Tracked the failing Windows shard back to Remedy on this head:
Local verification:
Watching the new Actions run now. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 85d229ab3f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
New head: 2ea8c0f The remaining Windows shard failure was still in This follow-up keeps the production fail-closed behavior intact and fixes only the fixture path:
Local verification on this head:
I also rechecked the underlying resolver locally with the seeded shim fixture to confirm the Windows materialized invocation now stays in the direct-entrypoint form ( Watching the new CI run now. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2ea8c0f518
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
New head: 6c0b086 Two follow-ups from the latest pass:
Local verification on this head:
I also replied to and resolved the duplicate open bot threads for the cache/config budget case. Watching the fresh CI run now. |
|
New head: c2d6f0f While the previous merge-run was still draining, it exposed one current- Local verification on this head:
Watching the fresh runs for this head now. |
c2d6f0f to
8525784
Compare
|
New head: 8525784 I rebased onto current Local verification on this head:
I also replied to and resolved the remaining outdated duplicate bot thread for the cache/config minimum-window case. The fresh |
|
New head: e947b0f The previous merge-run exposed one remaining Windows-shard regression in I reproduced the failing shard locally and tightened the test to accept both no-shell failure modes while still forbidding any Local verification on this head:
Watching the replacement |
e947b0f to
7b726eb
Compare
|
Follow-up on the Aisle note about bundling a blocked I looked at that path while fixing the mixed-message regression, and I intentionally kept this PR on the atomic side rather than introducing partial-success semantics under the same bug-fix. The concrete regression here was that we could tell the user the I agree there is a separate product/security question about whether privilege-reducing directives should be allowed to apply even when another directive in the same batch fails. I did not fold that policy change into this PR because it would require new acknowledgement semantics for partial success and would change directive-only command behavior beyond the scoped fix for #44303. |
7b726eb to
829b132
Compare
|
New head: 829b132
Local verification on this rebased head:
No new review threads were opened in the meantime; I’m watching the replacement Actions runs on this head now. |
|
One quick note on the latest Aisle analysis: it is pointing at commit I intentionally kept this PR atomic: if a |
829b132 to
d57a18f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d57a18f2e2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
1e976a2 to
79023aa
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 79023aab39
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
79023aa to
b7f447b
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b7f447bcfc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cbd5849872
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| function resolveEffectiveContextWindowTokens(params: { | ||
| cfg: OpenClawConfig; | ||
| model: string; | ||
| }): number | undefined { | ||
| const cachedWindow = normalizePositiveInt(lookupContextTokens(params.model)); | ||
| const configuredWindow = resolveConfiguredModelContextWindow(params); |
There was a problem hiding this comment.
Use target provider when resolving switch budgets
resolveEffectiveContextWindowTokens only keys off model, so the new guard compares against the bare-model cache / smallest configured duplicate-ID window instead of the selected provider’s runtime limit. In a long session this produces false blocks for safe provider switches, e.g. from a 128k provider to a 200k provider that shares the same model id: this guard will still pick 128k and reject /model, even though the actual execution path resolves the target model with provider + modelId in src/agents/pi-embedded-runner/run.ts:376-403 and applies that provider-specific window in src/agents/context-window-guard.ts:28-47. Fresh evidence on this head is that the preflight still drops targetProvider here, while the runner continues to honor it.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0d7a6a6ee9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (blockedModelSwitchText) { | ||
| return { text: blockedModelSwitchText }; |
There was a problem hiding this comment.
Clear runtime directives when blocking an inline /model
Returning here makes a mixed-message /model failure look final, but the current turn still honors other inline runtime overrides because resolveReplyDirectives has already derived them from the original directives in src/auto-reply/reply/get-reply-directives.ts:370-395 and :510, and runPreparedReply still consumes them in src/auto-reply/reply/get-reply-run.ts:474-479 and :553-563. For example, hello /model openai/gpt-4o /elevated full or /exec host=node will reply Current model unchanged yet run this turn with elevated/exec/think/queue changes anyway. If the new guard is meant to reject the inline /model attempt as a unit, those per-turn overrides need to be cleared before continuing.
Useful? React with 👍 / 👎.
Summary
/modelaccepted switches to smaller-context models even when the recorded session context was already too large for the target model./compactor/new; wired it into both the immediate/modelpath and the later persistence path so inline directives cannot sneak the switch through.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
/model <provider/model>now refuses the switch instead of accepting it and risking an immediate overflow on the next turn./compactor/new.Security Impact (required)
NoNoNoNoNoYes, explain risk + mitigation:Repro + Verification
Environment
anthropic/claude-opus-4-5at 200k andopenai/gpt-4oat 128kagents.defaults.compaction.reserveTokensFloor=20000Steps
/model openai/gpt-4o.Expected
Actual
Evidence
Attach at least one:
Human Verification (required)
What you personally verified (not just CI), and how:
/modelcommand rejection when the target safe budget is already exceeded; inline directive persistence does not write the blocked switch./modelsuccess path still passes existing tests; commands reply-path tests still pass.Review Conversations
Compatibility / Migration
YesNoNoFailure Recovery (if this breaks)
src/auto-reply/reply/directive-handling.impl.ts,src/auto-reply/reply/directive-handling.persist.ts,src/auto-reply/reply/model-switch-guard.tsRisks and Mitigations
/compactor/new, and the PR intentionally does not alter compaction logic itself.