Skip to content

fix(agents): classify expired thinking signatures#88072

Closed
BryanTegomoh wants to merge 1 commit into
openclaw:mainfrom
BryanTegomoh:bryan/fix-anthropic-thinking-replay-invalid
Closed

fix(agents): classify expired thinking signatures#88072
BryanTegomoh wants to merge 1 commit into
openclaw:mainfrom
BryanTegomoh:bryan/fix-anthropic-thinking-replay-invalid

Conversation

@BryanTegomoh

Copy link
Copy Markdown
Contributor

Summary

Classify Anthropic expired thinking-signature rejections as replay_invalid so the existing recovery path can strip stale thinking blocks and retry.

  • Add the missing Invalid signature in thinking block patterns to the replay-invalid classifier.
  • Keep generic Invalid signature errors out of replay recovery.
  • Add regression coverage for the exact wrapped Anthropic payload shape from the issue.

Linked context

Closes #88020

Real behavior proof (required for external PRs)

  • Behavior addressed: Anthropic Invalid signature in thinking block errors are now classified as replay_invalid instead of unclassified, allowing the existing stale-thinking replay recovery path to run.
  • Real environment tested: Local OpenClaw source checkout on macOS, current origin/main base, Node with the repo tsx loader, direct classifier import from src/agents/embedded-agent-helpers/errors.ts.
  • Exact steps or command run after this patch:
node --import tsx - <<'EOF'
import { classifyProviderRuntimeFailureKind } from './src/agents/embedded-agent-helpers/errors.ts';

const payload = '{"type":"error","error":{"type":"invalid_request_error","message":"messages.1.content.440: Invalid `signature` in `thinking` block"}}';
console.log(`expired-thinking-signature=${classifyProviderRuntimeFailureKind(payload)}`);
console.log(`generic-invalid-signature=${classifyProviderRuntimeFailureKind('Invalid signature')}`);
EOF
  • Evidence after fix:
expired-thinking-signature=replay_invalid
generic-invalid-signature=unclassified
  • Observed result after fix: The issue payload enters the replay_invalid path, while a generic invalid-signature message remains unclassified.
  • What was not tested: A live 45-60 minute Anthropic extended-thinking session that waits for provider-side signature expiry.
  • Proof limitations or environment constraints: The live expiry condition is time-dependent and provider-side. This PR verifies the exact post-rejection classifier boundary that gates the existing recovery retry.
  • Before evidence: Issue [Bug]: REPLAY_INVALID_RE missing Anthropic 'Invalid signature in thinking block' — hard session failure instead of recovery retry #88020 shows the same Anthropic payload hard-failing because the classifier did not match it.

Tests and validation

  • node scripts/run-vitest.mjs src/agents/embedded-agent-helpers.isbillingerrormessage.test.ts
  • pnpm exec oxfmt --check --threads=1 src/agents/embedded-agent-helpers/errors.ts src/agents/embedded-agent-helpers.isbillingerrormessage.test.ts
  • node scripts/run-oxlint.mjs --tsconfig config/tsconfig/oxlint.core.json src/agents/embedded-agent-helpers/errors.ts src/agents/embedded-agent-helpers.isbillingerrormessage.test.ts
  • pnpm changed:lanes --json
  • pnpm check:changed
  • git diff --check

Regression coverage added:

  • Wrapped Anthropic invalid_request_error with Invalid signature in thinking block classifies as replay_invalid.
  • ValidationException: invalid signature on thinking block classifies as replay_invalid.
  • Generic Invalid signature does not classify as replay_invalid.

Risk checklist

Did user-visible behavior change? Yes
Did config, environment, or migration behavior change? No
Did security, auth, secrets, network, or tool execution behavior change? No

Highest-risk area: Provider runtime failure classification.
Mitigation: The new match requires both signature and thinking-block language, and the test proves generic invalid-signature errors do not enter replay recovery.

Current review state

Next action: Maintainer review.
Waiting on: CI and any maintainer request for live Anthropic expiry proof.
Bot or reviewer comments addressed: None yet.

@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: XS proof: supplied External PR includes structured after-fix real behavior proof. labels May 29, 2026
@clawsweeper

clawsweeper Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed May 29, 2026, 12:38 PM ET / 16:38 UTC.

Summary
The PR expands the provider replay-invalid classifier for Anthropic thinking-signature errors and adds regression coverage for the wrapped payload plus a generic-invalid-signature guard.

PR surface: Source 0, Tests +14. Total +14 across 2 files.

Reproducibility: yes. Current main lacks the signature/thinking-block replay-invalid match, and the linked issue provides the exact Anthropic payload that falls through the current classifier.

Review metrics: none identified.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🐚 platinum hermit
Patch quality: 🦞 diamond lobster
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • none.

Risk before merge

  • [P1] A live 45-60 minute Anthropic expiry session was not rerun; merge confidence comes from the exact reported payload, source path, existing replay sanitization coverage, and the PR's direct classifier proof.

Maintainer options:

  1. Decide the mitigation before merge
    Land the focused classifier and regression-test patch once CI is acceptable; request live Anthropic retry proof only if maintainers need end-to-end expiry assurance.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge

  • No repair lane is needed because the branch already contains the focused code and test change; maintainer review and CI are the remaining path.

Security
Cleared: The diff only changes a local error-classification regex and colocated tests; I found no concrete security or supply-chain concern.

Review details

Best possible solution:

Land the focused classifier and regression-test patch once CI is acceptable; request live Anthropic retry proof only if maintainers need end-to-end expiry assurance.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main lacks the signature/thinking-block replay-invalid match, and the linked issue provides the exact Anthropic payload that falls through the current classifier.

Is this the best way to solve the issue?

Yes. Extending the existing replay-invalid classifier with a narrow signature plus thinking-block match, while preserving the generic Invalid signature negative case, is the smallest maintainable fix path.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against dc7bd4abf556.

Label changes

Label changes:

  • add P1: The linked bug hard-fails active Anthropic extended-thinking sessions and kills the workflow instead of using the existing replay recovery path.
  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix terminal output from a real source checkout directly exercising the changed classifier boundary, which is the behavior this patch changes.
  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🐚 platinum hermit and patch quality is 🦞 diamond lobster.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (live_output): The PR body includes after-fix terminal output from a real source checkout directly exercising the changed classifier boundary, which is the behavior this patch changes.

Label justifications:

  • P1: The linked bug hard-fails active Anthropic extended-thinking sessions and kills the workflow instead of using the existing replay recovery path.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🐚 platinum hermit and patch quality is 🦞 diamond lobster.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (live_output): The PR body includes after-fix terminal output from a real source checkout directly exercising the changed classifier boundary, which is the behavior this patch changes.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes after-fix terminal output from a real source checkout directly exercising the changed classifier boundary, which is the behavior this patch changes.
Evidence reviewed

PR surface:

Source 0, Tests +14. Total +14 across 2 files.

View PR surface stats
Area Files Added Removed Net
Source 1 1 1 0
Tests 1 14 0 +14
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 2 15 1 +14

What I checked:

Likely related people:

  • joshavant: Local blame on current main attributes the replay-invalid regex, classifyProviderRuntimeFailureKind path, and stripInvalidThinkingSignatures comment block to the same existing agent-helper source commit; history is shallow/grafted, so this is a routing signal rather than a full authorship trail. (role: current classifier and replay-sanitization area contributor; confidence: medium; commits: ab84c8cc0949; files: src/agents/embedded-agent-helpers/errors.ts, src/agents/embedded-agent-runner/thinking.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P1 High-priority user-facing bug, regression, or broken workflow. labels May 29, 2026
@Takhoffman

Copy link
Copy Markdown
Contributor

@clawsweeper visualize

@clawsweeper

clawsweeper Bot commented May 30, 2026

Copy link
Copy Markdown
Contributor

🦞👀
ClawSweeper visual brief is being prepared.

I queued a read-only visual pass. It will create or update one marker-backed visual brief comment and will not trigger close, merge, repair, label, or branch changes.

Lens: auto

@clawsweeper

clawsweeper Bot commented May 30, 2026

Copy link
Copy Markdown
Contributor

Source: #88072 (comment)
Visual model: gpt-5.5, reasoning low.

Visual brief

Lens: flow
Advisory only: maintainers remain the final judges.

PR: #88072
Linked bug: #88020

Current broken path from issue #88020

Long Anthropic thinking session
        |
        v
Provider rejects old thinking block signature
        |
        v
"Invalid `signature` in `thinking` block"
        |
        v
Classifier: unclassified ❌
        |
        v
No replay cleanup
        |
        v
Session hard-fails 🐛
Proposed path in PR #88072

Same provider rejection
        |
        v
"Invalid `signature` in `thinking` block"
        |
        v
Classifier: replay_invalid ✅
        |
        v
Existing stale-thinking cleanup path
stripInvalidThinkingSignatures(...)
        |
        v
Retry request ✅
Classifier boundary

Input pattern                                      Result
------------------------------------------------- ---------------
Invalid `signature` in `thinking` block            replay_invalid ✅
ValidationException: invalid signature on thinking block
                                                   replay_invalid ✅
Invalid signature                                  unclassified ✅
Changed surface

src/agents/embedded-agent-helpers/errors.ts
  REPLAY_INVALID_RE gains thinking-signature patterns

src/agents/embedded-agent-helpers.isbillingerrormessage.test.ts
  Adds regression coverage for:
  - wrapped Anthropic invalid_request_error payload ✅
  - ValidationException wording ✅
  - generic Invalid signature guard ✅
Proof map

Reported bug proof:
  Real long-session failures, 3 times, provider-side expiry path 🐛

PR proof:
  Exact rejected payload now classifies as replay_invalid ✅
  Generic invalid-signature text does not enter replay recovery ✅

Not proven:
  Live 45-60 minute Anthropic expiry retry after patch ⚠️
Maintainer judgment point 🧑‍⚖️

Benefit:
  Small classifier change may restore existing recovery for a high-impact
  session-state failure.

Risk:
  Replay recovery expands to thinking-signature errors; guard test limits this
  to messages containing signature + thinking/thinking block language.

Decision hinge:
  Is exact payload classifier proof enough, or is live provider-expiry proof
  required before accepting the behavior change?

Legend: ✅ expected/proven; ❌ broken path; ⚠️ unresolved concern; 🐛 confirmed bug path; 🧑‍⚖️ maintainer judgment point.

Maintainer ruling

Benefit: Routes the reported Anthropic expired-thinking-signature rejection into the existing replay recovery path.
Risk: Classifier broadening could over-route provider errors if future wording overlaps, though the generic Invalid signature guard reduces that risk.
Proof needed: Optional live 45-60 minute Anthropic expiry retry proof if maintainers require end-to-end provider validation.
Recommended next action: Maintainer review of the classifier boundary and whether supplied proof is sufficient.
Question presented: Should exact rejected-payload proof be accepted for this XS recovery-path classifier fix?

@Takhoffman

Copy link
Copy Markdown
Contributor

@clawsweeper automerge

@clawsweeper

clawsweeper Bot commented May 30, 2026

Copy link
Copy Markdown
Contributor

🦞🔧
ClawSweeper saw the passing review, but the PR needs another repair pass before merge.

Source: clawsweeper[bot]
Feedback: - No repair lane is needed because the branch already contains the focused code and test change; maintainer review and CI are the remaining path.; Cleared: The diff only changes a local error-classification regex and colocated tests; I found no concrete security or supply-chain concern. (sha=794dbaf4dcdc577b3d8e076b27e5fe270b1ea87d); later maintainer automerge opt-in approves landing the canonical PR; failed required checks before automerge: check-additional-boundaries-bcd:FAILURE
Action: repair worker queued. Run: https://github.com/openclaw/clawsweeper/actions/runs/26684662589
Model: gpt-5.5

I will update this PR branch, or open a safe credited replacement, if the repair worker finds a narrow CI fix.

Automerge progress:

  • 2026-05-30 13:09:55 UTC review queued 794dbaf4dcdc (queued)
  • 2026-05-29 16:38:44 UTC review passed 794dbaf4dcdc (- No repair lane is needed because the branch already contains the focused code...)

@clawsweeper clawsweeper Bot added the clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge label May 30, 2026
@clawsweeper

clawsweeper Bot commented May 30, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper 🐠 reef update

Thanks for the work here. ClawSweeper could not write to the source branch, so it opened a replacement PR rather than letting the fix drift. attribution still points back here.

Why replacement: ClawSweeper could not update the source PR branch directly; GitHub did not grant sufficient push rights to the bot for that branch.
Replacement PR: #88340
Why close: this run explicitly closes the superseded source PR after the credited replacement PR is open, so review continues in one place.
Closing this one because the run was configured to close superseded source PRs after opening the replacement.
Credit follows the fix over to the replacement PR. no sneaky treasure grab.
Co-author credit kept:

fish notes: model gpt-5.5, reasoning high; reviewed against 57c80d9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: XS status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: REPLAY_INVALID_RE missing Anthropic 'Invalid signature in thinking block' — hard session failure instead of recovery retry

2 participants