Bug Description
The /v1/chat/completions endpoint returns a 500 error with "Failed to parse input at pos 25" when a multimodal projector (--mmproj) is loaded alongside the model. The /completion endpoint works fine with the same model.
Environment
- llama-server version: 1 (d6f999b), built with GNU 11.4.0 for Linux x86_64
- Model:
Qwen3.5-27B-Q8_0.gguf + Qwen3.5-27B-mmproj-BF16.gguf
- Hardware: 2x RTX 3090 (48GB VRAM), Linux x86_64
- Launch flags:
llama-server --host 127.0.0.1 --metrics --port 41131 \
--remap-developer-role --alias qwen3.5-27b --cont-batching \
--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on \
--model /models/gguf/qwen3.5-27b/Qwen3.5-27B-Q8_0.gguf \
--mmproj /models/gguf/qwen3.5-27b/Qwen3.5-27B-mmproj-BF16.gguf \
--n-gpu-layers 999 --parallel 1
Reproduction
Failing request (/v1/chat/completions):
curl -s http://127.0.0.1:41131/v1/chat/completions -X POST \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5-27b","messages":[{"role":"user","content":"hi"}],"max_tokens":8}'
Response:
{"error":{"code":500,"message":"Failed to parse input at pos 25: ","type":"server_error"}}
All content formats fail — both "content": "string" and "content": [{"type": "text", "text": "..."}] produce the same error (different pos values: 25 and 53 respectively).
Working request (/completion with manual Qwen chat template):
curl -s http://127.0.0.1:41131/completion -X POST \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5-27b","prompt":"<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n","n_predict":16,"stop":["<|im_end|>"]}'
This works perfectly and returns a valid response.
Root Cause Analysis
The GGUF file does contain a valid chat template at tokenizer.chat_template (a Qwen vision template with image/video handling). However, when querying the /props endpoint, chat_template is not reported — suggesting llama-server ignores/disables the embedded chat template when --mmproj is loaded.
Without a chat template, the chat completions endpoint cannot parse the messages array, hence the "Failed to parse input" error.
Expected Behavior
The /v1/chat/completions endpoint should work with VLM models that have mmproj loaded, using the embedded chat template from the GGUF metadata. Text-only chat requests should be handled normally, and multimodal requests (with image_url content parts) should route through the vision pipeline.
Workaround
Use the /completion endpoint with the Qwen chat template applied manually:
<|im_start|>user\n{message}<|im_end|>\n<|im_start|>assistant\n<think>\n</think>\n
For image inputs, use image_data parameter with the /completion endpoint.
Bug Description
The
/v1/chat/completionsendpoint returns a 500 error with"Failed to parse input at pos 25"when a multimodal projector (--mmproj) is loaded alongside the model. The/completionendpoint works fine with the same model.Environment
Qwen3.5-27B-Q8_0.gguf+Qwen3.5-27B-mmproj-BF16.ggufReproduction
Failing request (
/v1/chat/completions):Response:
{"error":{"code":500,"message":"Failed to parse input at pos 25: ","type":"server_error"}}All content formats fail — both
"content": "string"and"content": [{"type": "text", "text": "..."}]produce the same error (different pos values: 25 and 53 respectively).Working request (
/completionwith manual Qwen chat template):This works perfectly and returns a valid response.
Root Cause Analysis
The GGUF file does contain a valid chat template at
tokenizer.chat_template(a Qwen vision template with image/video handling). However, when querying the/propsendpoint,chat_templateis not reported — suggesting llama-server ignores/disables the embedded chat template when--mmprojis loaded.Without a chat template, the chat completions endpoint cannot parse the
messagesarray, hence the "Failed to parse input" error.Expected Behavior
The
/v1/chat/completionsendpoint should work with VLM models that have mmproj loaded, using the embedded chat template from the GGUF metadata. Text-only chat requests should be handled normally, and multimodal requests (withimage_urlcontent parts) should route through the vision pipeline.Workaround
Use the
/completionendpoint with the Qwen chat template applied manually:For image inputs, use
image_dataparameter with the/completionendpoint.