mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output)#13784
mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output)#13784ngxson merged 17 commits intoggml-org:masterfrom
Conversation
|
wow . I like this |
|
@ggerganov Sorry for including quite a lot of changes in one single PR. The global idea of this PR is to use 2 dedicated A given On
Don't hesitate to ping if something is not clear for you. Thanks! |
|
Amazing work!! |
|
Test results: |
tools/mtmd/clip.cpp
Outdated
| img->nx = hparams.warmup_audio_size; | ||
| img->ny = hparams.n_mel_bins; | ||
| } | ||
| img->buf.resize(img->nx * img->ny * 3); |
There was a problem hiding this comment.
Is this needed for audio modalities? We have a single channel in this case, correct?
There was a problem hiding this comment.
Indeed, only the image shape is needed during this warmup, so we don't actually need to allocate this buffer. I removed it in 0531096
| // M-RoPE for audio | ||
| void set_position_mrope_1d(llama_pos pos_0, int32_t n_tokens, llama_seq_id seq_id) { | ||
| GGML_ASSERT(n_pos_per_embd == 4); | ||
| seq_id_0[0] = seq_id; | ||
| for (int i = 0; i < n_tokens; i++) { | ||
| pos[i ] = pos_0 + i; | ||
| pos[i + batch.n_tokens ] = pos_0 + i; | ||
| pos[i + batch.n_tokens * 2] = pos_0 + i; | ||
| pos[i + batch.n_tokens * 3] = 0; // last pos dim is unused | ||
| } | ||
| for (int i = 0; i < batch.n_tokens; i++) { | ||
| batch.n_seq_id[i] = 1; | ||
| batch.seq_id [i] = seq_id_0.data(); | ||
| batch.logits [i] = false; | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
I think here n_tokens and batch.n_tokens refer to the same thing. If so, should simplify by using for example only batch.n_tokens and remove the n_tokens argument. Or vice versa.
There was a problem hiding this comment.
Yes thanks for noticing, it's a leftover code from the 2D version, I removed it in 27a8f26
|
Just fyi, with Vulkan backend (AMD RADV RENOIR) and the Q8_0 mmproj file the server crashes with:
Detailsggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none build: 5510 (a8ea03d8) with cc (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7) for x86_64-redhat-linux system info: n_threads = 5, n_threads_batch = 6, total_threads = 12 (mmproj-Qwen2.5-Omni-7B-Q8_0.gguf) The error is avoided with The FP16 mmproj file works properly. Maybe, there is an issue with the Q8_0 file or with the way the Vulkan backend handles the mixed precision format. Other Q8_0 mmproj files for other models (e.g. mmproj-InternVL3-2B-Instruct-Q8_0.gguf) work properly (although this is a vision-only model). (Did not open an issue in case the issue is with the file and not the backend). |
|
@ngxson Just out of curiosity, how much work is it to implement diffusion support? It seems like there are more and more models coming out with "all-to-all" capabilities (like Bagel), would be probably nice for image generation to appear on the roadmap at some point... |
|
@ngxson Thanks in advance! |
|
So, Will we ever have support for videos? |
Complete rewrite of Qwen backend to use Qwen2.5-Omni models via llama.cpp CLI instead of llama-cpp-python API. ## Why the Change? **Qwen2-Audio is broken in llama.cpp:** - Known to have "significant hallucination issues" - Considered unusable by llama.cpp community - Python API (create_chat_completion) doesn't support audio properly **Qwen2.5-Omni is the solution:** - Newer, better quality multimodal model - Actually works with llama.cpp - Supports 10,000+ languages (Chinese, Arabic, Japanese, Korean, etc.) - Uses CLI interface (llama-mtmd-cli) which has proper audio support ## Implementation **New Approach: CLI via Subprocess** - Uses `llama-mtmd-cli` (multimodal CLI) instead of Python API - Passes audio files via `--audio` flag - Parses stdout for transcription results - Timeout protection (120s) **Model Updates:** - qwen2.5-omni-7b-q4: 4.8GB - Fast, recommended - qwen2.5-omni-7b-q6: 6.4GB - Very good quality - qwen2.5-omni-7b-q8: 8.2GB - Best quality - qwen2.5-omni-3b-q4: 2.5GB - Smaller, faster **Requirements:** - llama.cpp must be compiled with llama-mtmd-cli binary - Binary must be in PATH or ~/.local/bin/ - Checks for binary availability during load() ## Technical Details **Audio Processing:** 1. Convert PCM bytes to WAV file (llama-mtmd-cli needs WAV/MP3) 2. Call llama-mtmd-cli with audio file path 3. Parse output, clean llama.cpp artifacts 4. Return transcription **GPU Support:** - Uses `-ngl 999` for CUDA (all layers on GPU) - Uses `-ngl 0` for CPU - Respects device setting from coordinator **Error Handling:** - Checks for llama-mtmd-cli availability - Verifies model files exist - Timeout protection for long audio - Clean error messages guide user ## Benefits - ✅ Actually works (unlike Qwen2-Audio) - ✅ 10,000+ language support - ✅ Better transcription quality - ✅ Proper multimodal audio handling - ✅ No Python API limitations ## Known Limitations - Requires llama.cpp installation (not automatic) - Slower than Python API (subprocess overhead) - CLI output parsing may be fragile - No streaming support References: - ggml-org/llama.cpp#13759 - ggml-org/llama.cpp#13784 - https://huggingface.co/mradermacher/Qwen2.5-Omni-7B-GGUF Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This PR aimed to just add the capability to use 2
clip_ctx, one for audio and one for vision. But I ended up doing quite more refactoring than I initially thoughtclip_ctxtoclip_modelclip_model_loader, we are now able to create multipleclip_modelandclip_ctxlibmtmdcan handle 2clip_ctxand switch the calls according to the chunk typemtmd_tokenizeso it can handle mixed modalityclip.cpphas manyif..elsebranches, we can refactor them toswitch (...)SinusoidsPositionEmbeddingused by the audio encoder --> generate it during conversionTODO in next PRs:
mtmd-image.cppmtmd-helper.cppWhy no audio output?
The simple answer is: I don't have time to implement it.
The long answer: Qwen 2.5 Omni generates audio using 2 steps:
So adding audio generation is indeed adding image generation capability, which I don't really have time to do right now.
Demo
Pre-quantized models: