fix(server): signal unenforced response_format via warning log and Warning header by richgoodson · Pull Request #1564 · jundot/omlx

richgoodson · 2026-05-31T11:48:52Z

/v1/chat/completions accepts response_format with type: "json_schema" and strict: true, returns 200, but the assistant content does not follow the schema. The request looks honored when it isn't, so a client has no way to tell that the output is unconstrained.

Root cause

response_format is only schema-enforced when a grammar compiler is available. _compile_grammar_for_request (server.py) reads engine.grammar_compiler; when that is None (no xgrammar installed, or an engine that never installs one such as DFlash), or when grammar compilation raises, the request silently falls back to prompt injection (server.py:2233). The model is asked to produce JSON in the prompt, but nothing constrains decoding, so undeclared fields appear, required fields are dropped, and on larger inputs the content stops being valid JSON. structured_outputs already returns a 400 in this case; response_format was the silent path.

Fix

The reporter asked for one of two things: enforce the schema, or signal that it wasn't. Enforcing on every engine is a larger change (it needs grammar wiring on engines that don't have a compiler at all). This PR does the signal, on both surfaces:

Operator-visible log. The fallback now logs a warning instead of degrading silently. A strict: true request gets a message that names the unhonored strict intent.
Client-visible header. The response carries an RFC 7234 Warning header (code 199) on both the streaming and non-streaming paths so the caller can detect the degrade without parsing the body. strict requests get a header that says the output is NOT schema-enforced; non-strict response_format requests get a generic one.

The request still returns 200 with best-effort content. I chose not to turn an accepted request into a 400. That is a breaking behavior change, and neither I nor the reporter owns this repo. If maintainers prefer a hard reject, the strict-detection helper (_response_format_requests_strict) is the single place to branch on.

Scope

This addresses /v1/chat/completions, which is what the issue reports. /v1/completions and /v1/messages share _compile_grammar_for_request, so they get the warning log, but not the header. Extending the header to those endpoints is a separate change. Actually enforcing the schema on compiler-less engines (the other half of the reporter's ask) is also separate.

Evidence

Live-tested against a DFlash model (Qwen3.6-27B-4bit), which has no grammar compiler, so every response_format request hits the fallback.

Strict json_schema, non-streaming. The body reproduces the bug from the issue (src instead of dst), and the header now flags it:

HTTP/1.1 200 OK
warning: 199 omlx "response_format strict json_schema not enforced; grammar-constrained decoding unavailable, output is best-effort and NOT schema-enforced"
content-type: application/json

{"choices":[{"message":{"role":"assistant","content":"{\"items\": [{\"id\": \"1\", \"src\": \"Hello.\"}]}"},"finish_reason":"stop"}], ...}

Strict json_schema, streaming (stream: true). Same header on the SSE response:

HTTP/1.1 200 OK
content-type: text/event-stream
warning: 199 omlx "response_format strict json_schema not enforced; grammar-constrained decoding unavailable, output is best-effort and NOT schema-enforced"

Non-strict json_object. Generic header, no strict wording:

HTTP/1.1 200 OK
warning: 199 omlx "response_format not enforced; grammar-constrained decoding unavailable, output is best-effort"

No response_format (control). No Warning header:

HTTP/1.1 200 OK
content-type: application/json

Unit tests: tests/test_grammar.py covers the strict-detection helper, the degrade-not-raise behavior, and the header text (strict vs generic, single-line ASCII). Full grammar suite is green (72 passed, 3 skipped, the skips are pre-existing).

Fixes #1241

…onse_format When a client sends `response_format={type: json_schema, json_schema: {...}}` on a model whose engine exposes no grammar compiler (e.g. the DFlash speculative engine, or any build without xgrammar), `_compile_grammar_for_request` returned `None` silently and the request fell back to prompt injection. On weaker or reasoning models that fallback does not honor the schema: output drifts from the schema, reasoning tags leak into `message.content`, and the content can be invalid JSON. The client got no signal that the schema was not enforced. Make the downgrade observable instead of silent: - A `response_format` that cannot be grammar-constrained (no compiler available, or a compilation error) now logs a warning before falling back to prompt injection. - When `strict: true` was requested, the warning names the unhonored strict intent ("strict enforcement cannot be honored ... output is NOT schema-enforced"). Behavior is otherwise unchanged: the request still succeeds via prompt injection, and `structured_outputs` keeps its existing hard 400 when grammar is unavailable. The reporter on jundot#1241 explicitly listed rejecting the request (HTTP 400) as an acceptable resolution. I deliberately chose the softer warn-and-fall-back instead: neither the reporter nor I own this repo, and turning a previously-accepted request into a 400 is a breaking change the maintainers have not signed off on. Warning keeps the endpoint backward compatible while still surfacing the lost guarantee, so the maintainers can decide whether to escalate to rejection. Scope note: this does not add grammar derivation from `tools` / forced `tool_choice` (jundot#1472, jundot#1258) or wire grammar into the DFlash decode path; those are larger, engine-level changes tracked separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… header The previous commit logged when a response_format request fell back to best-effort prompt injection (no grammar compiler, or a compile error), but that signal was only visible to the operator. A client calling /v1/chat/completions still got a 200 with no way to tell that the schema was not enforced. Add a Warning response header (RFC 7234, code 199) on both the streaming and non-streaming chat-completion paths whenever the request degrades. strict json_schema requests get a header that names the unhonored strict intent; other response_format requests get a generic one. This answers the reporter's ask on jundot#1241 for a client-visible signal without turning an accepted request into a 400 (a breaking change we don't own). The existing warn log stays, so the downgrade is now visible to both the operator and the caller. Header values are single-line ASCII by construction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jundot · 2026-06-01T09:13:07Z

Thanks, this direction makes sense to me for #1241. Keeping the existing 200 behavior while surfacing the downgrade to both operators and clients is a reasonable compatibility-preserving fix.

I found one condition that looks too broad before merge. In create_chat_completion, compiled_grammar is None and response_format also matches response_format: {"type": "text"}, where no grammar enforcement was requested, so the response would incorrectly carry a Warning header saying response_format was not enforced.

Could you gate the header/fallback-warning path on a response format that actually maps to grammar-constrained JSON output (json_object / json_schema) and add a small test that type: "text" does not emit a Warning header?

jundot's jundot#1241 review flagged the degrade condition as too broad: in create_chat_completion, `compiled_grammar is None and response_format` also matched `response_format: {"type": "text"}`, where no grammar enforcement was ever requested, so the response wrongly carried a Warning header (and a prompt-injection fallback) saying response_format was not enforced. Gate the header/fallback path on a new `_response_format_requests_grammar` helper (sibling of `_response_format_requests_strict`) so only the formats that actually map to grammar-constrained JSON output (json_object / json_schema) can be reported as unenforced. A plain text format never asked for enforcement and now stays silent. Add TestResponseFormatRequestsGrammar covering the predicate, including that `type: "text"` does not request grammar (and so emits no Warning header) per the review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…mat_element Replace the type-only predicate with a delegation to _build_format_element so the unenforced-degrade signal stays in sync with what actually gets compiled. A grammar element is built only for json_object and a json_schema that carries a schema; a json_schema with no schema now maps to nothing and no longer emits a Warning header claiming "grammar-constrained decoding unavailable" when the request never described an enforceable grammar. This also keeps the client-facing header consistent with the server-side warn log, which already keys off the same buildability check. Add tests pinning that a schemaless json_schema and an unknown type do not request grammar.

richgoodson · 2026-06-01T14:17:39Z

Addressed the review. I gated the Warning header and prompt-injection fallback on formats that actually map to grammar-constrained JSON output (json_object / json_schema), so a response_format: {"type": "text"} no longer carries a header saying it was not enforced. Added a test covering that case.

I derived the gate from _build_format_element rather than matching on the type string, so it stays in sync with the existing warn log: a json_schema with no schema maps to no grammar and no longer claims "grammar-constrained decoding unavailable" when nothing enforceable was requested.

Full tests/test_grammar.py passes (63 passed, 20 skipped where xgrammar is not installed).

jundot · 2026-06-06T17:17:38Z

Thanks for addressing the warning gate. I rechecked the final diff against #1241: the chat completion path now only warns when response_format maps to an enforceable grammar, text and schemaless json_schema stay silent, and the Warning header is attached on both streaming and non-streaming chat responses. CI is green, and this looks good to me. I'll merge this.

richgoodson and others added 2 commits May 30, 2026 22:29

richgoodson marked this pull request as ready for review May 31, 2026 12:02

richgoodson and others added 2 commits June 1, 2026 08:54

jundot merged commit 001b343 into jundot:main Jun 6, 2026
4 checks passed

richgoodson deleted the fix/1241-strict-response-format-enforce branch June 6, 2026 18:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): signal unenforced response_format via warning log and Warning header#1564

fix(server): signal unenforced response_format via warning log and Warning header#1564
jundot merged 4 commits into
jundot:mainfrom
richgoodson:fix/1241-strict-response-format-enforce

richgoodson commented May 31, 2026

Uh oh!

jundot commented Jun 1, 2026

Uh oh!

richgoodson commented Jun 1, 2026

Uh oh!

jundot commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

richgoodson commented May 31, 2026

Root cause

Fix

Scope

Evidence

Uh oh!

jundot commented Jun 1, 2026

Uh oh!

richgoodson commented Jun 1, 2026

Uh oh!

jundot commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants