Skip to content

fix(server): signal unenforced response_format via warning log and Warning header#1564

Merged
jundot merged 4 commits into
jundot:mainfrom
richgoodson:fix/1241-strict-response-format-enforce
Jun 6, 2026
Merged

fix(server): signal unenforced response_format via warning log and Warning header#1564
jundot merged 4 commits into
jundot:mainfrom
richgoodson:fix/1241-strict-response-format-enforce

Conversation

@richgoodson

Copy link
Copy Markdown
Contributor

/v1/chat/completions accepts response_format with type: "json_schema" and strict: true, returns 200, but the assistant content does not follow the schema. The request looks honored when it isn't, so a client has no way to tell that the output is unconstrained.

Root cause

response_format is only schema-enforced when a grammar compiler is available. _compile_grammar_for_request (server.py) reads engine.grammar_compiler; when that is None (no xgrammar installed, or an engine that never installs one such as DFlash), or when grammar compilation raises, the request silently falls back to prompt injection (server.py:2233). The model is asked to produce JSON in the prompt, but nothing constrains decoding, so undeclared fields appear, required fields are dropped, and on larger inputs the content stops being valid JSON. structured_outputs already returns a 400 in this case; response_format was the silent path.

Fix

The reporter asked for one of two things: enforce the schema, or signal that it wasn't. Enforcing on every engine is a larger change (it needs grammar wiring on engines that don't have a compiler at all). This PR does the signal, on both surfaces:

  • Operator-visible log. The fallback now logs a warning instead of degrading silently. A strict: true request gets a message that names the unhonored strict intent.
  • Client-visible header. The response carries an RFC 7234 Warning header (code 199) on both the streaming and non-streaming paths so the caller can detect the degrade without parsing the body. strict requests get a header that says the output is NOT schema-enforced; non-strict response_format requests get a generic one.

The request still returns 200 with best-effort content. I chose not to turn an accepted request into a 400. That is a breaking behavior change, and neither I nor the reporter owns this repo. If maintainers prefer a hard reject, the strict-detection helper (_response_format_requests_strict) is the single place to branch on.

Scope

This addresses /v1/chat/completions, which is what the issue reports. /v1/completions and /v1/messages share _compile_grammar_for_request, so they get the warning log, but not the header. Extending the header to those endpoints is a separate change. Actually enforcing the schema on compiler-less engines (the other half of the reporter's ask) is also separate.

Evidence

Live-tested against a DFlash model (Qwen3.6-27B-4bit), which has no grammar compiler, so every response_format request hits the fallback.

Strict json_schema, non-streaming. The body reproduces the bug from the issue (src instead of dst), and the header now flags it:

HTTP/1.1 200 OK
warning: 199 omlx "response_format strict json_schema not enforced; grammar-constrained decoding unavailable, output is best-effort and NOT schema-enforced"
content-type: application/json

{"choices":[{"message":{"role":"assistant","content":"{\"items\": [{\"id\": \"1\", \"src\": \"Hello.\"}]}"},"finish_reason":"stop"}], ...}

Strict json_schema, streaming (stream: true). Same header on the SSE response:

HTTP/1.1 200 OK
content-type: text/event-stream
warning: 199 omlx "response_format strict json_schema not enforced; grammar-constrained decoding unavailable, output is best-effort and NOT schema-enforced"

Non-strict json_object. Generic header, no strict wording:

HTTP/1.1 200 OK
warning: 199 omlx "response_format not enforced; grammar-constrained decoding unavailable, output is best-effort"

No response_format (control). No Warning header:

HTTP/1.1 200 OK
content-type: application/json

Unit tests: tests/test_grammar.py covers the strict-detection helper, the degrade-not-raise behavior, and the header text (strict vs generic, single-line ASCII). Full grammar suite is green (72 passed, 3 skipped, the skips are pre-existing).

Fixes #1241

richgoodson and others added 2 commits May 30, 2026 22:29
…onse_format

When a client sends `response_format={type: json_schema, json_schema: {...}}`
on a model whose engine exposes no grammar compiler (e.g. the DFlash speculative
engine, or any build without xgrammar), `_compile_grammar_for_request` returned
`None` silently and the request fell back to prompt injection. On weaker or
reasoning models that fallback does not honor the schema: output drifts from the
schema, reasoning tags leak into `message.content`, and the content can be invalid
JSON. The client got no signal that the schema was not enforced.

Make the downgrade observable instead of silent:

- A `response_format` that cannot be grammar-constrained (no compiler available, or
  a compilation error) now logs a warning before falling back to prompt injection.
- When `strict: true` was requested, the warning names the unhonored strict intent
  ("strict enforcement cannot be honored ... output is NOT schema-enforced").

Behavior is otherwise unchanged: the request still succeeds via prompt injection,
and `structured_outputs` keeps its existing hard 400 when grammar is unavailable.

The reporter on jundot#1241 explicitly listed rejecting the request (HTTP 400) as an
acceptable resolution. I deliberately chose the softer warn-and-fall-back instead:
neither the reporter nor I own this repo, and turning a previously-accepted request
into a 400 is a breaking change the maintainers have not signed off on. Warning
keeps the endpoint backward compatible while still surfacing the lost guarantee, so
the maintainers can decide whether to escalate to rejection.

Scope note: this does not add grammar derivation from `tools` / forced
`tool_choice` (jundot#1472, jundot#1258) or wire grammar into the DFlash decode path; those are
larger, engine-level changes tracked separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… header

The previous commit logged when a response_format request fell back to
best-effort prompt injection (no grammar compiler, or a compile error),
but that signal was only visible to the operator. A client calling
/v1/chat/completions still got a 200 with no way to tell that the schema
was not enforced.

Add a Warning response header (RFC 7234, code 199) on both the streaming
and non-streaming chat-completion paths whenever the request degrades.
strict json_schema requests get a header that names the unhonored strict
intent; other response_format requests get a generic one. This answers
the reporter's ask on jundot#1241 for a client-visible signal without turning
an accepted request into a 400 (a breaking change we don't own). The
existing warn log stays, so the downgrade is now visible to both the
operator and the caller.

Header values are single-line ASCII by construction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@richgoodson richgoodson marked this pull request as ready for review May 31, 2026 12:02
@jundot

jundot commented Jun 1, 2026

Copy link
Copy Markdown
Owner

Thanks, this direction makes sense to me for #1241. Keeping the existing 200 behavior while surfacing the downgrade to both operators and clients is a reasonable compatibility-preserving fix.

I found one condition that looks too broad before merge. In create_chat_completion, compiled_grammar is None and response_format also matches response_format: {"type": "text"}, where no grammar enforcement was requested, so the response would incorrectly carry a Warning header saying response_format was not enforced.

Could you gate the header/fallback-warning path on a response format that actually maps to grammar-constrained JSON output (json_object / json_schema) and add a small test that type: "text" does not emit a Warning header?

richgoodson and others added 2 commits June 1, 2026 08:54
jundot's jundot#1241 review flagged the degrade condition as too broad: in
create_chat_completion, `compiled_grammar is None and response_format`
also matched `response_format: {"type": "text"}`, where no grammar
enforcement was ever requested, so the response wrongly carried a
Warning header (and a prompt-injection fallback) saying response_format
was not enforced.

Gate the header/fallback path on a new `_response_format_requests_grammar`
helper (sibling of `_response_format_requests_strict`) so only the
formats that actually map to grammar-constrained JSON output
(json_object / json_schema) can be reported as unenforced. A plain text
format never asked for enforcement and now stays silent.

Add TestResponseFormatRequestsGrammar covering the predicate, including
that `type: "text"` does not request grammar (and so emits no Warning
header) per the review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mat_element

Replace the type-only predicate with a delegation to _build_format_element
so the unenforced-degrade signal stays in sync with what actually gets
compiled. A grammar element is built only for json_object and a json_schema
that carries a schema; a json_schema with no schema now maps to nothing and
no longer emits a Warning header claiming "grammar-constrained decoding
unavailable" when the request never described an enforceable grammar. This
also keeps the client-facing header consistent with the server-side warn log,
which already keys off the same buildability check.

Add tests pinning that a schemaless json_schema and an unknown type do not
request grammar.
@richgoodson

Copy link
Copy Markdown
Contributor Author

Addressed the review. I gated the Warning header and prompt-injection fallback on formats that actually map to grammar-constrained JSON output (json_object / json_schema), so a response_format: {"type": "text"} no longer carries a header saying it was not enforced. Added a test covering that case.

I derived the gate from _build_format_element rather than matching on the type string, so it stays in sync with the existing warn log: a json_schema with no schema maps to no grammar and no longer claims "grammar-constrained decoding unavailable" when nothing enforceable was requested.

Full tests/test_grammar.py passes (63 passed, 20 skipped where xgrammar is not installed).

@jundot

jundot commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Thanks for addressing the warning gate. I rechecked the final diff against #1241: the chat completion path now only warns when response_format maps to an enforceable grammar, text and schemaless json_schema stay silent, and the Warning header is attached on both streaming and non-streaming chat responses. CI is green, and this looks good to me. I'll merge this.

@jundot jundot merged commit 001b343 into jundot:main Jun 6, 2026
4 checks passed
@richgoodson richgoodson deleted the fix/1241-strict-response-format-enforce branch June 6, 2026 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

response_format.type=json_schema is accepted by /v1/chat/completions but not enforced in assistant content

2 participants