fix(server): signal unenforced response_format via warning log and Warning header#1564
Conversation
…onse_format
When a client sends `response_format={type: json_schema, json_schema: {...}}`
on a model whose engine exposes no grammar compiler (e.g. the DFlash speculative
engine, or any build without xgrammar), `_compile_grammar_for_request` returned
`None` silently and the request fell back to prompt injection. On weaker or
reasoning models that fallback does not honor the schema: output drifts from the
schema, reasoning tags leak into `message.content`, and the content can be invalid
JSON. The client got no signal that the schema was not enforced.
Make the downgrade observable instead of silent:
- A `response_format` that cannot be grammar-constrained (no compiler available, or
a compilation error) now logs a warning before falling back to prompt injection.
- When `strict: true` was requested, the warning names the unhonored strict intent
("strict enforcement cannot be honored ... output is NOT schema-enforced").
Behavior is otherwise unchanged: the request still succeeds via prompt injection,
and `structured_outputs` keeps its existing hard 400 when grammar is unavailable.
The reporter on jundot#1241 explicitly listed rejecting the request (HTTP 400) as an
acceptable resolution. I deliberately chose the softer warn-and-fall-back instead:
neither the reporter nor I own this repo, and turning a previously-accepted request
into a 400 is a breaking change the maintainers have not signed off on. Warning
keeps the endpoint backward compatible while still surfacing the lost guarantee, so
the maintainers can decide whether to escalate to rejection.
Scope note: this does not add grammar derivation from `tools` / forced
`tool_choice` (jundot#1472, jundot#1258) or wire grammar into the DFlash decode path; those are
larger, engine-level changes tracked separately.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… header The previous commit logged when a response_format request fell back to best-effort prompt injection (no grammar compiler, or a compile error), but that signal was only visible to the operator. A client calling /v1/chat/completions still got a 200 with no way to tell that the schema was not enforced. Add a Warning response header (RFC 7234, code 199) on both the streaming and non-streaming chat-completion paths whenever the request degrades. strict json_schema requests get a header that names the unhonored strict intent; other response_format requests get a generic one. This answers the reporter's ask on jundot#1241 for a client-visible signal without turning an accepted request into a 400 (a breaking change we don't own). The existing warn log stays, so the downgrade is now visible to both the operator and the caller. Header values are single-line ASCII by construction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks, this direction makes sense to me for #1241. Keeping the existing 200 behavior while surfacing the downgrade to both operators and clients is a reasonable compatibility-preserving fix. I found one condition that looks too broad before merge. In Could you gate the header/fallback-warning path on a response format that actually maps to grammar-constrained JSON output ( |
jundot's jundot#1241 review flagged the degrade condition as too broad: in create_chat_completion, `compiled_grammar is None and response_format` also matched `response_format: {"type": "text"}`, where no grammar enforcement was ever requested, so the response wrongly carried a Warning header (and a prompt-injection fallback) saying response_format was not enforced. Gate the header/fallback path on a new `_response_format_requests_grammar` helper (sibling of `_response_format_requests_strict`) so only the formats that actually map to grammar-constrained JSON output (json_object / json_schema) can be reported as unenforced. A plain text format never asked for enforcement and now stays silent. Add TestResponseFormatRequestsGrammar covering the predicate, including that `type: "text"` does not request grammar (and so emits no Warning header) per the review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mat_element Replace the type-only predicate with a delegation to _build_format_element so the unenforced-degrade signal stays in sync with what actually gets compiled. A grammar element is built only for json_object and a json_schema that carries a schema; a json_schema with no schema now maps to nothing and no longer emits a Warning header claiming "grammar-constrained decoding unavailable" when the request never described an enforceable grammar. This also keeps the client-facing header consistent with the server-side warn log, which already keys off the same buildability check. Add tests pinning that a schemaless json_schema and an unknown type do not request grammar.
|
Addressed the review. I gated the Warning header and prompt-injection fallback on formats that actually map to grammar-constrained JSON output (json_object / json_schema), so a I derived the gate from Full |
|
Thanks for addressing the warning gate. I rechecked the final diff against #1241: the chat completion path now only warns when response_format maps to an enforceable grammar, text and schemaless json_schema stay silent, and the Warning header is attached on both streaming and non-streaming chat responses. CI is green, and this looks good to me. I'll merge this. |
/v1/chat/completionsacceptsresponse_formatwithtype: "json_schema"andstrict: true, returns 200, but the assistant content does not follow the schema. The request looks honored when it isn't, so a client has no way to tell that the output is unconstrained.Root cause
response_formatis only schema-enforced when a grammar compiler is available._compile_grammar_for_request(server.py) readsengine.grammar_compiler; when that isNone(no xgrammar installed, or an engine that never installs one such as DFlash), or when grammar compilation raises, the request silently falls back to prompt injection (server.py:2233). The model is asked to produce JSON in the prompt, but nothing constrains decoding, so undeclared fields appear, required fields are dropped, and on larger inputs the content stops being valid JSON.structured_outputsalready returns a 400 in this case;response_formatwas the silent path.Fix
The reporter asked for one of two things: enforce the schema, or signal that it wasn't. Enforcing on every engine is a larger change (it needs grammar wiring on engines that don't have a compiler at all). This PR does the signal, on both surfaces:
strict: truerequest gets a message that names the unhonored strict intent.Warningheader (code 199) on both the streaming and non-streaming paths so the caller can detect the degrade without parsing the body. strict requests get a header that says the output is NOT schema-enforced; non-strictresponse_formatrequests get a generic one.The request still returns 200 with best-effort content. I chose not to turn an accepted request into a 400. That is a breaking behavior change, and neither I nor the reporter owns this repo. If maintainers prefer a hard reject, the strict-detection helper (
_response_format_requests_strict) is the single place to branch on.Scope
This addresses
/v1/chat/completions, which is what the issue reports./v1/completionsand/v1/messagesshare_compile_grammar_for_request, so they get the warning log, but not the header. Extending the header to those endpoints is a separate change. Actually enforcing the schema on compiler-less engines (the other half of the reporter's ask) is also separate.Evidence
Live-tested against a DFlash model (
Qwen3.6-27B-4bit), which has no grammar compiler, so everyresponse_formatrequest hits the fallback.Strict
json_schema, non-streaming. The body reproduces the bug from the issue (srcinstead ofdst), and the header now flags it:Strict
json_schema, streaming (stream: true). Same header on the SSE response:Non-strict
json_object. Generic header, no strict wording:No
response_format(control). NoWarningheader:Unit tests:
tests/test_grammar.pycovers the strict-detection helper, the degrade-not-raise behavior, and the header text (strict vs generic, single-line ASCII). Full grammar suite is green (72 passed, 3 skipped, the skips are pre-existing).Fixes #1241