Summary
When using llama-server with Qwen3.6-35B-A3B and a matching mmproj, consecutive images in a single user message are sometimes merged into super-frames. As a result, 4 uploaded images are interpreted as 2 images, and the model can only describe part of the visual content.
Environment
llama.cpp release: b9553
- Model:
Qwen3.6-35B-A3B-Q4_K_M.gguf
- mmproj:
mmproj-Qwen3.6-35B-A3B-BF16.gguf
- Server:
llama-server
- UI: browser chat page
- Also reproducible through the OpenAI-compatible API
Reproduction steps
-
Start llama-server with:
llama-server -m Qwen3.6-35B-A3B-Q4_K_M.gguf --mmproj mmproj-Qwen3.6-35B-A3B-BF16.gguf
-
Open the web UI.
-
Upload 4 images in one message.
-
Ask: How many images are there?
-
Observe that the model answers 2 instead of 4.
API reproduction
This works correctly if images are separated by text:
{
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "[image 1]" },
{ "type": "image_url", "image_url": { "url": "<BASE64>" } },
{ "type": "text", "text": "[image 2]" },
{ "type": "image_url", "image_url": { "url": "<BASE64>" } },
{ "type": "text", "text": "[image 3]" },
{ "type": "image_url", "image_url": { "url": "<BASE64>" } },
{ "type": "text", "text": "[image 4]" },
{ "type": "image_url", "image_url": { "url": "<BASE64>" } },
{ "type": "text", "text": "How many images are there?" }
]
}
]
}
In this case, the model answers 4.
However, when the same 4 images are sent consecutively without text separators:
{
"messages": [
{
"role": "user",
"content": [
{ "type": "image_url", "image_url": { "url": "<BASE64>" } },
{ "type": "image_url", "image_url": { "url": "<BASE64>" } },
{ "type": "image_url", "image_url": { "url": "<BASE64>" } },
{ "type": "image_url", "image_url": { "url": "<BASE64>" } },
{ "type": "text", "text": "How many images are there?" }
]
}
]
}
the model answers 2.
Expected behavior
- All 4 images should be treated as 4 separate images.
- The model should answer
4.
- The model should be able to describe content from all 4 images independently.
Actual behavior
- Consecutive images appear to be merged into 2 units.
- The model answers
2.
- When asked about the image contents, it only describes part of the images.
Notes
This seems related to the recent frame-merge / super-frame behavior for consecutive images in Qwen-VL-style models.
It looks like the merge is triggered only when images are consecutive in the content array. If text is inserted between images, the issue disappears.
Possible impact
This breaks multimodal behavior for users who upload multiple images in one turn, because the model may under-count images and miss part of the visual context.
Summary
When using
llama-serverwithQwen3.6-35B-A3Band a matchingmmproj, consecutive images in a single user message are sometimes merged into super-frames. As a result, 4 uploaded images are interpreted as 2 images, and the model can only describe part of the visual content.Environment
llama.cpprelease:b9553Qwen3.6-35B-A3B-Q4_K_M.ggufmmproj-Qwen3.6-35B-A3B-BF16.ggufllama-serverReproduction steps
Start
llama-serverwith:Open the web UI.
Upload 4 images in one message.
Ask:
How many images are there?Observe that the model answers
2instead of4.API reproduction
This works correctly if images are separated by text:
{ "messages": [ { "role": "user", "content": [ { "type": "text", "text": "[image 1]" }, { "type": "image_url", "image_url": { "url": "<BASE64>" } }, { "type": "text", "text": "[image 2]" }, { "type": "image_url", "image_url": { "url": "<BASE64>" } }, { "type": "text", "text": "[image 3]" }, { "type": "image_url", "image_url": { "url": "<BASE64>" } }, { "type": "text", "text": "[image 4]" }, { "type": "image_url", "image_url": { "url": "<BASE64>" } }, { "type": "text", "text": "How many images are there?" } ] } ] }In this case, the model answers
4.However, when the same 4 images are sent consecutively without text separators:
{ "messages": [ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "<BASE64>" } }, { "type": "image_url", "image_url": { "url": "<BASE64>" } }, { "type": "image_url", "image_url": { "url": "<BASE64>" } }, { "type": "image_url", "image_url": { "url": "<BASE64>" } }, { "type": "text", "text": "How many images are there?" } ] } ] }the model answers
2.Expected behavior
4.Actual behavior
2.Notes
This seems related to the recent frame-merge / super-frame behavior for consecutive images in Qwen-VL-style models.
It looks like the merge is triggered only when images are consecutive in the content array. If text is inserted between images, the issue disappears.
Possible impact
This breaks multimodal behavior for users who upload multiple images in one turn, because the model may under-count images and miss part of the visual context.