Studio: Anthropic fast_mode toggle and streaming refusal handling#5715
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for Anthropic's fast-mode beta for Claude Opus 4.6 and 4.7, including backend request handling for the speed parameter and beta headers, frontend UI controls, and improved feedback for streaming refusals. Feedback was provided regarding the implementation of model support checks; the reviewer noted that model discovery should be handled dynamically via the Anthropic models endpoint rather than using a static list of model prefixes to maintain consistency with repository standards.
| fast_mode_active = bool(fast_mode) and _anthropic_supports_fast_mode(model) | ||
| if fast_mode_active: | ||
| body["speed"] = "fast" |
There was a problem hiding this comment.
The logic for activating fast_mode correctly implements the 'silent drop' policy for unsupported models. By checking both the user toggle and model support before modifying the request body, it prevents potential 400 errors from the upstream API while maintaining a clean user experience. Please verify that model support is determined dynamically via the Anthropic /v1/models endpoint rather than a static list, to ensure consistency with repository standards for model discovery.
References
- Anthropic provides a /v1/models endpoint for model discovery; do not implement special cases or static lists for Anthropic under the assumption that the standard models endpoint is unsupported.
Fast mode (beta `fast-mode-2026-02-01`) lets Claude Opus 4.6 and 4.7 generate output tokens up to 2.5x faster at 6x standard Opus pricing. The toggle lives in Configuration → Provider when the selected Anthropic model is Opus 4.6 or 4.7 and is otherwise hidden. Backend gates the same prefixes a second time so a stale frontend cannot make Anthropic 400 the request, and the `fast-mode-2026-02-01` beta header is merged onto whatever other betas the request already needed (code-execution, compaction). Streaming refusals (`message_delta.delta.stop_reason="refusal"` on Claude 4 models) now surface a short user-facing notice in the assistant message before the translated OpenAI chunk emits the existing `finish_reason="content_filter"`. Previously the chat bubble truncated silently because the SSE stopped mid-stream with no visible explanation. Per the upstream docs the conversation must be reset before continuing, so the notice tells the user exactly that. Reference: - https://platform.claude.com/docs/en/build-with-claude/fast-mode - https://platform.claude.com/docs/en/test-and-evaluate/strengthen-guardrails/handle-streaming-refusals Tests: - studio/backend/tests/test_anthropic_fast_mode_and_refusal.py (8 cases pinning fast_mode pass-through on 4.6/4.7, silent drop on Sonnet / Haiku / older Opus / None / False, and the refusal notice + finish reason on a synthetic refusal stream).
Anthropic's streaming-refusal guidance says the refused assistant turn must be removed or updated before the next call -- otherwise the safety classifier keeps refusing. The PR only added a user-visible notice; the partial assistant output (plus the notice itself) still rode the next request via toOpenAIMessage. Tag the refusal turn with an HTML-comment sentinel emitted alongside the notice. The chat-adapter checks for that sentinel in toOpenAIMessage and returns null, so the refused turn is excluded from outboundMessages. The notice still renders in the transcript (HTML comments don't display), so users keep the explanation.
test_refusal_maps_to_content_filter expects only ['content_filter'] in the finish_reasons list, but the post-PR refusal path emits a user-visible content notice chunk first. Every _content_chunk carries 'finish_reason: None' by construction; the helper was appending those, so the assertion saw [None, 'content_filter'] instead of ['content_filter']. None is not a finish reason -- it's just mid-stream delta noise. Skip None values in _finish_reasons so the helper reflects what the test names actually claim to check. Same fix applies cleanly to the other helper usages (pause_turn test expects [] and the sibling stop test expects ['stop'], both unaffected).
c362bbd to
aae360f
Compare
Adds 19 cases on top of the 9 in test_anthropic_fast_mode_and_refusal. The base file pins the happy path; this file fills in the cliffs: * Dated-snapshot prefix matching: claude-opus-4-7-2026-02-01 and claude-opus-4-6-2026-02-01 still gate fast_mode through, while claude-opus-4-5-2025-08-01 and claude-sonnet-4-6-2026-02-01 do not. * Strict opt-in: a future claude-opus-4-8 or claude-opus-5 does NOT auto-enable fast_mode -- the prefix tuple must be bumped explicitly when a new family is whitelisted upstream. * Beta-header merge: fast_mode coexists with code-execution-2025-08-25 and compact-2026-01-12 in one comma-separated anthropic-beta header with no duplicates and no truncation. Pins the value to the exact fast-mode-2026-02-01 docs token so a typo would fail CI. * Non-destruction: fast_mode=None produces byte-identical outbound body and headers to the version that omits the argument entirely. Same for fast_mode=False. Guarantees the upgrade path is non-breaking on existing Anthropic streams. * Refusal stream ordering: the user-visible notice precedes the finish_reason chunk so a streaming UI paints text before flipping to content_filter. Refusal sentinel emitted exactly once. Notice rides a normal content delta chunk with finish_reason still null. Partial assistant deltas survive before the notice. * Provider-side refusal coverage: a refusal on Sonnet (not just Opus) still emits the notice + sentinel + content_filter mapping, since refusal handling is not gated on fast-mode capability.
for more information, see https://pre-commit.ci
Two follow-ups on #5715: 1) sanitizeInferenceParams stripped fastMode. fastMode is in PERSISTED_INFERENCE_PARAM_KEYS but the storage sanitizer only kept numeric fields plus systemPrompt and trustRemoteCode, so the new toggle was silently dropped on reload and on the /api/chat/settings round-trip. Save it the same way trustRemoteCode is saved. 2) Refusal recovery now also drops the triggering user turn. Returning null from toOpenAIMessage on the assistant side left the user prompt that caused the refusal in the outbound history, so the very next request would re-trigger the same classifier. Anthropic's refusal-handling guidance is explicit on this: remove the refused turn AND the user message that triggered it before the next call. Implemented via a pre-pass that pops the trailing user message when an assistant carries the refusal sentinel. Typecheck clean.
…ixes The text sentinel for the Anthropic refusal drop signal was spoofable: any assistant message containing the literal <!--studio:anthropic-refusal--> would prune the prior user + assistant pair on the next request. Move the signal onto a separate _toolEvent chunk that the chat adapter latches into assistant.metadata.custom.anthropicRefusal; assistant text can no longer control the pruner. Tighten the fast-mode model gate (backend + frontend) to require a "-" family boundary so claude-opus-4-70 / claude-opus-4-7b style IDs do not get speed: "fast" on a naive startswith match. Use survivingMessages for the image / audio attachment scan so a refused user turn does not gate or mis-attribute the next non-refused turn. Propagate Anthropic usage.speed onto the OpenAI-style usage chunk and apply the documented 6x fast-mode multiplier in the cost calculator (stacks with prompt-cache multipliers per the docs); expose the new multiplier on the pricing snapshot for the UI tooltip. Tests cover the tool-event chunk shape, the prefix-collision rejects, usage.speed propagation, the 6x pricing math, and that the visible refusal text carries no embedded sentinel.
|
Round 3 audit (10-parallel reviewer + manual cross-check against Anthropic docs). Verified the round-2 surface against Pushed
Test deltas:
Reviewer ranked P3 cosmetic findings (async test cleanup warnings on Python 3.13) intentionally left out of this round; tests pass, warnings are unraisable post-collection and do not affect CI. |
for more information, see https://pre-commit.ci
|
Round 4 cross-cutting fix: merged origin/main into this branch (no conflicts) to bring in PR #5735 (orphan tool_call XML strip widening + 263-line test_tool_xml_strip.py). All 8 PRs in this audit cohort had been forked off a pre-#5735 main, so a squash-merge of any of them would have silently reverted the widened _TOOL_XML_RE regex and deleted the dedicated test file. Verified: diff against origin/main now shows zero unintended changes to routes/inference.py and test_tool_xml_strip.py outside the actual PR scope. |
The fast_mode 6x multiplier landed in two places at once -- here (f66df7b) and on unslothai#5715 (4f1afdb) -- since both audits ran in parallel. Drop the duplicate from this branch so the change lives in its natural home (unslothai#5715, which introduces fast_mode itself); this PR stays focused on the cache-read fallback + 1h breakdown.
…hat-style usage keys) (unslothai#5722) * Studio: longest-prefix pricing match + accept chat-style usage keys Two P1 / High follow-ups from PR 5690 review feedback: 1. Pricing prefix lookup returned the first key it iterated, so dated snapshots like ``gpt-5.4-mini-2026-04-23`` collided with the shorter ``gpt-5.4`` entry and overbilled by 3x+. Sort the table keys longest-first so the most specific entry wins. 2. ``calculate_cost`` only read ``input_tokens`` / ``output_tokens``, but Studio's OpenAI-Chat-style usage envelope re-emits ``prompt_tokens`` / ``completion_tokens`` (the OpenAI Chat Completions vocabulary). Callers handing in the chat-style shape silently got a zeroed bill. Accept either pair so the calculator works against both raw upstream usage and the Studio-translated envelope. Tests (4 new in test_pricing.py): dated mini/pro snapshots inherit the right rate; chat-style usage keys price correctly; raw key wins when both shapes are present. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: dedupe cache buckets when costing chat-style Anthropic usage When the caller hands in Studio's chat-style envelope (``prompt_tokens`` emitted by ``_build_usage_chunk``) for Anthropic, that value already folds ``cache_creation_input_tokens`` + ``cache_read_input_tokens`` into the total. The previous follow-up accepted the chat-style key but then re-added both cache buckets in ``billable_input_tokens`` and ``input_usd``, double-counting cache tokens on every Anthropic chat-style call. Detect which envelope landed (``input_tokens`` present = raw upstream; absent + ``prompt_tokens`` present = Studio chat-style) and peel the cache buckets off for Anthropic before the downstream math so both envelopes produce identical costs. OpenAI: ``input_tokens`` and Studio's ``prompt_tokens`` both already include ``cache_read`` and exclude any notional ``cache_creation``, so the OpenAI path stays a straight passthrough. Tests (2 new): both envelopes match for Anthropic on a triple (uncached + cache_creation + cache_read); OpenAI envelopes match on a cached-tokens fixture. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: prefer raw output_tokens over chat-style completion_tokens Codex flagged that the previous fallback chain 'usage.get("output_tokens") or usage.get("completion_tokens")' treats an explicit 0 as missing -- a mixed-envelope payload where 'output_tokens' is 0 but 'completion_tokens' is non-zero (or stale) bills the wrong amount. Mirror the has_input_tokens precedence pattern: when the raw key is present we use it even at 0; otherwise fall back to completion_tokens. * Studio: read OpenAI cached tokens from prompt_tokens_details too Codex flagged that the chat-style OpenAI envelope Studio re-emits via _build_usage_chunk surfaces cached prompt tokens under prompt_tokens_details.cached_tokens, not input_tokens_details. The OpenAI branch only checked input_tokens_details, so a cache-heavy chat-style turn billed every cached token at the full input rate instead of the 0.1x cache_read discount. Walk both keys when discovering the cached count. New regression test pins that the two envelopes price identically for a turn with 80k of 100k tokens cached. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: tighten pricing prefix match + clamp corrupt usage Three follow-ups on the longest-prefix pricing match landed in this PR: - Prefix match now requires a dash boundary or end-of-string. The longest-key sort alone still falsely landed "claude-opus-4-15" on the "claude-opus-4-1" row, and "gpt-5.5-prod" on the "gpt-5.5-pro" row (a 6x overcharge). Demanding the next character be "-" rules out the lookalikes while keeping dated snapshots ("gpt-5.4-mini-2026-04-23", "claude-opus-4-7-20260414") landing on their canonical row. - Clamp every token count to >= 0. A corrupted upstream payload (negative cached count, off-by-one in a fixture) could previously produce a negative bill that masked real spend in the session total tooltip. - Tolerate a non-dict "cache_creation" (e.g. an upstream proxy folded the field down to a single int). The current code raised AttributeError mid-turn; now it falls back to the 5m-default bucket so the rest of the cost calculation still runs. Adds tests/test_pricing_edge.py with 20 adversarial cases covering the boundary check, negative / None / zero token values across both envelopes, cache_read > prompt corruption, the OpenAI long-context threshold crossover on cache-inflated billable input, malformed sub-objects, and unknown-provider degradation. Combined suite is 51 tests, all green. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Surface Anthropic cache-read fallback and forward 1h breakdown Two correctness gaps surfaced on the chat-style usage envelope: 1) Anthropic cache_read fell through to "uncached input" pricing when the envelope arrived without the native ``cache_read_input_tokens`` key (e.g. via a proxy that only emits the mirrored ``prompt_tokens_details.cached_tokens`` block). Studio's canonical ``_build_usage_chunk`` always sets both so production traffic was never affected, but the calculator should accept either as a defense-in-depth measure. Add a fallback to read the mirrored field when the native one is missing or zero; the native key still wins when both are present so the math stays deterministic. 2) ``_build_usage_chunk`` dropped the ``cache_creation`` 5m / 1h breakdown. Downstream ``calculate_cost`` then could not apply the 2x 1h premium and silently fell back to the 5m default, underbilling 1h cache writes by 2x on chat-style traffic. Forward the breakdown verbatim when the upstream usage carries it. Tests grow by 4 (20 -> 24): two for the prompt_tokens_details fallback (with native-precedence pin), one for the chunk shape, one for the end-to-end pricing parity check at 1h. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add Anthropic fast_mode pricing multiplier PR 5715 wires the fast-mode-2026-02-01 beta header + speed:"fast" field through to Anthropic, but the cost calculator never learnt about the matching 6x premium documented at https://platform.claude.com/docs/en/build-with-claude/fast-mode (Opus 4.7 standard $5/$25 per MTok, fast $30/$150). This adds: - ANTHROPIC_FAST_MODE_MULT = 6.0 constant. - calculate_cost(..., fast_mode=True) applies the 6x to base input AND output rates before any cache multipliers (cache mults stack on top of fast per Anthropic docs). - Provider+model gate: silently no-op on every model that is not claude-opus-4-6 / claude-opus-4-7 so a stray fast_mode=True on Sonnet/Haiku can never over-charge. - model_priced label tagged "(fast)" so the cost tooltip can surface which rate fired. - pricing_snapshot now exposes fast_mode_mult so the frontend cost panel doesn't have to hard-code 6. 7 new edge tests pin the math; existing 55 still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Honor explicit zero cache_read_input_tokens on Anthropic envelopes The previous follow-up fell back to ``prompt_tokens_details.cached_tokens`` whenever the native ``cache_read_input_tokens`` was missing OR equal to 0, even though the commit message stated the native key always wins when present. A proxy that forwards a stale ``prompt_tokens_details`` block alongside an authoritative ``cache_read_input_tokens: 0`` would then inflate cache_read past the real native count, posting a false cache_read line and bumping billable_input_tokens. Switch the gate to native-key presence so an explicit zero stays authoritative; the mirror only kicks in when the native key is absent. Add a regression test pinning the explicit-zero precedence. * Move fast_mode pricing back to unslothai#5715 The fast_mode 6x multiplier landed in two places at once -- here (f66df7b) and on unslothai#5715 (4f1afdb) -- since both audits ran in parallel. Drop the duplicate from this branch so the change lives in its natural home (unslothai#5715, which introduces fast_mode itself); this PR stays focused on the cache-read fallback + 1h breakdown. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Shorten pricing comments for PR unslothai#5722 --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Main moved forward 17 commits during PR review (latest: 953c8bf). Real conflicts in five files; resolved by combining both branches' changes. studio/backend/core/inference/external_provider.py - Add fast_mode (Anthropic Opus 4.6/4.7 speed flag, unslothai#5715) to stream_chat_completion and Anthropic-branch call site, alongside existing Gemini tools/tool_choice forwarding. - Add _openai_image_generation_tool() helper (action:"edit" for follow- up image edits, unslothai#5712) and use it inside the existing _responses_hosted_builtins_allowed gate so the forced-function / tool_choice="none" suppression added in rounds 21+ still applies. - Keep Anthropic web_fetch gated on _anthropic_hosted_builtins_allowed (round 19+ hosted-builtin gate) while taking main's per-model version selector (web_fetch_20260209 vs _20250910). studio/backend/routes/inference.py - Add `openai = provider_type == "openai"` (used by main's reasoning content forwarding for follow-up image edits). - Keep the round 25/26 Gemini filter chain (_filter_tool_calls drops synthetic server-builtin cards, marks tc_id so the matching role="tool" follow-up gets skipped, extra_content gated to native Gemini host). - Forward fast_mode alongside tools/tool_choice. studio/backend/tests/test_openai_image_generation.py - Combine assertions: both _server_tool: True (PR) and openai_image_generation_call_id (main) are present on the tool_start arguments. studio/frontend/src/features/chat/shared-composer.tsx - Add supportsBuiltinWebFetch declaration (separate Fetch pill from unslothai#5742) before the PR's isExternalGemini constant so both the Gemini image-tier gating and the standalone Anthropic Fetch pill compile. studio/frontend/src/features/chat/api/chat-adapter.ts - Add main's normalizeOpenAIReasoningItem, toOpenAIImageEditReferenceMessage, isAnthropicRefusalMessage helpers alongside PR's collectAssistantToolCalls, collectToolResultMessages, SerializedMessage, collectAssistantTextThoughtSignature. - toOpenAIMessages (PR) now also early-returns on isAnthropicRefusalMessage so refused turns get pruned from outbound history. - Add a thin toOpenAIMessage (singular) wrapper for the OpenAI image- edit replay path's flat .map() usage. - Merge per-turn enable flags: keep PR's imageGenerationEnabledForThisTurn, geminiImageModeForThisTurn, codeExecEnabledForThisTurn !geminiImageMode gate; take main's webFetchEnabledForThisTurn (sourced from independent webFetchToolsEnabled pill state). - Outbound build chains main's anthropic_refusal survivingMessages prune, then flatMap(toOpenAIMessages) (PR), then PR's selectedImageEditReference reference message prepend; image-edit unavailable toast from main fires before any of that when the pill is off. - tool_end merge: do main's nextArgs spread first, then PR's Gemini native_part parts concat so both OpenAI image-call ids and Gemini executableCode/codeExecutionResult/inlineData round-trip. - Cumulative + final yields: orderAssistantContent(pinTextThoughtSignature(...)) composes main's tool-vs-text ordering with PR's per-text thoughtSignature pin. Tests: gemini provider 148/148; openai_responses_translation + openai_code_execution + openai_image_generation + anthropic_code_execution + anthropic_web_fetch + external_provider_usage_chunk + providers_api: 50 passed, 42 skipped; main's new anthropic_fast_mode + citations + openai_citation_markers + openai_tool_result_fallbacks suites all 43/43.
Summary
speed: \"fast\"and thefast-mode-2026-02-01beta header for Claude Opus 4.6 / 4.7. Up to 2.5x higher output tokens per second at 6x standard Opus pricing per the upstream docs. Hidden on every other model + provider so the picker never offers a knob the upstream API would 400 on; backend silently drops the flag as a second line of defence.message_delta.delta.stop_reason = \"refusal\"on Claude 4 models, the assistant message now ends with a short notice explaining what happened, followed by the existing OpenAI-specfinish_reason = \"content_filter\". Previously the chat bubble truncated silently with nothing to tell the user why the response stopped.References:
Test plan
cd studio && python -m pytest backend/tests/test_anthropic_fast_mode_and_refusal.py -v-- 8 new tests, all passing.fast_modepass-through on Opus 4.6 + 4.7 (body field + beta header).content_filterfinish reason while preserving the partial assistant text.cd studio/frontend && npx tsc -b --pretty falseclean.claude-opus-4-7with fast mode on -- defer to QA once Anthropic surfaces the per-key fast-mode rate limit (waitlist gate).