Skip to content

chat/completions endpoint returns 500 when mmproj is loaded (Qwen3.5-27B VLM) #16

@marksverdhei

Description

@marksverdhei

Bug Description

The /v1/chat/completions endpoint returns a 500 error with "Failed to parse input at pos 25" when a multimodal projector (--mmproj) is loaded alongside the model. The /completion endpoint works fine with the same model.

Environment

  • llama-server version: 1 (d6f999b), built with GNU 11.4.0 for Linux x86_64
  • Model: Qwen3.5-27B-Q8_0.gguf + Qwen3.5-27B-mmproj-BF16.gguf
  • Hardware: 2x RTX 3090 (48GB VRAM), Linux x86_64
  • Launch flags:
    llama-server --host 127.0.0.1 --metrics --port 41131 \
      --remap-developer-role --alias qwen3.5-27b --cont-batching \
      --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on \
      --model /models/gguf/qwen3.5-27b/Qwen3.5-27B-Q8_0.gguf \
      --mmproj /models/gguf/qwen3.5-27b/Qwen3.5-27B-mmproj-BF16.gguf \
      --n-gpu-layers 999 --parallel 1
    

Reproduction

Failing request (/v1/chat/completions):

curl -s http://127.0.0.1:41131/v1/chat/completions -X POST \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.5-27b","messages":[{"role":"user","content":"hi"}],"max_tokens":8}'

Response:

{"error":{"code":500,"message":"Failed to parse input at pos 25: ","type":"server_error"}}

All content formats fail — both "content": "string" and "content": [{"type": "text", "text": "..."}] produce the same error (different pos values: 25 and 53 respectively).

Working request (/completion with manual Qwen chat template):

curl -s http://127.0.0.1:41131/completion -X POST \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.5-27b","prompt":"<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n","n_predict":16,"stop":["<|im_end|>"]}'

This works perfectly and returns a valid response.

Root Cause Analysis

The GGUF file does contain a valid chat template at tokenizer.chat_template (a Qwen vision template with image/video handling). However, when querying the /props endpoint, chat_template is not reported — suggesting llama-server ignores/disables the embedded chat template when --mmproj is loaded.

Without a chat template, the chat completions endpoint cannot parse the messages array, hence the "Failed to parse input" error.

Expected Behavior

The /v1/chat/completions endpoint should work with VLM models that have mmproj loaded, using the embedded chat template from the GGUF metadata. Text-only chat requests should be handled normally, and multimodal requests (with image_url content parts) should route through the vision pipeline.

Workaround

Use the /completion endpoint with the Qwen chat template applied manually:

<|im_start|>user\n{message}<|im_end|>\n<|im_start|>assistant\n<think>\n</think>\n

For image inputs, use image_data parameter with the /completion endpoint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions