Skip to content

[BUG] Qwen3.6-35B-A3B / llama-server merges consecutive images into 2 frames, causing incorrect image count and partial image understanding #24303

@ucgggg

Description

@ucgggg

Summary

When using llama-server with Qwen3.6-35B-A3B and a matching mmproj, consecutive images in a single user message are sometimes merged into super-frames. As a result, 4 uploaded images are interpreted as 2 images, and the model can only describe part of the visual content.

Environment

  • llama.cpp release: b9553
  • Model: Qwen3.6-35B-A3B-Q4_K_M.gguf
  • mmproj: mmproj-Qwen3.6-35B-A3B-BF16.gguf
  • Server: llama-server
  • UI: browser chat page
  • Also reproducible through the OpenAI-compatible API

Reproduction steps

  1. Start llama-server with:

    llama-server -m Qwen3.6-35B-A3B-Q4_K_M.gguf --mmproj mmproj-Qwen3.6-35B-A3B-BF16.gguf
  2. Open the web UI.

  3. Upload 4 images in one message.

  4. Ask: How many images are there?

  5. Observe that the model answers 2 instead of 4.

API reproduction

This works correctly if images are separated by text:

{
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "[image 1]" },
        { "type": "image_url", "image_url": { "url": "<BASE64>" } },
        { "type": "text", "text": "[image 2]" },
        { "type": "image_url", "image_url": { "url": "<BASE64>" } },
        { "type": "text", "text": "[image 3]" },
        { "type": "image_url", "image_url": { "url": "<BASE64>" } },
        { "type": "text", "text": "[image 4]" },
        { "type": "image_url", "image_url": { "url": "<BASE64>" } },
        { "type": "text", "text": "How many images are there?" }
      ]
    }
  ]
}

In this case, the model answers 4.

However, when the same 4 images are sent consecutively without text separators:

{
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "image_url", "image_url": { "url": "<BASE64>" } },
        { "type": "image_url", "image_url": { "url": "<BASE64>" } },
        { "type": "image_url", "image_url": { "url": "<BASE64>" } },
        { "type": "image_url", "image_url": { "url": "<BASE64>" } },
        { "type": "text", "text": "How many images are there?" }
      ]
    }
  ]
}

the model answers 2.

Expected behavior

  • All 4 images should be treated as 4 separate images.
  • The model should answer 4.
  • The model should be able to describe content from all 4 images independently.

Actual behavior

  • Consecutive images appear to be merged into 2 units.
  • The model answers 2.
  • When asked about the image contents, it only describes part of the images.

Notes

This seems related to the recent frame-merge / super-frame behavior for consecutive images in Qwen-VL-style models.

It looks like the merge is triggered only when images are consecutive in the content array. If text is inserted between images, the issue disappears.

Possible impact

This breaks multimodal behavior for users who upload multiple images in one turn, because the model may under-count images and miss part of the visual context.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions