mtmd, llama: add GLM4V vision-language model support#17967
mtmd, llama: add GLM4V vision-language model support#17967eelbaz wants to merge 1 commit intoggml-org:masterfrom
Conversation
|
I have a feeling you didn't mean to submit this here? |
ngxson
left a comment
There was a problem hiding this comment.
I read the clip.cpp changes quickly but I think the solution is not clean enough. Would appreciate if you can simplify it.
tools/mtmd/clip.cpp
Outdated
| * 1. Normalize image (already done by caller) | ||
| * 2. Convert to channel-first [C, H, W] | ||
| * 3. Duplicate temporal dimension [2, C, H, W] | ||
| * 4. Reshape with merge_size=2: [grid_t, temporal, channels, merge_h, merge_size, patch_h, merge_w, merge_size, patch_w] |
There was a problem hiding this comment.
I'm not convinced that this procedure cannot be done using GGML ops.
converting to channel-first ggml_permute
duplicate dim can be ggml_repeat_4d, although I'm pretty sure this is redundant since GGML support broadcasting internally
reshape is ggml_reshape
There was a problem hiding this comment.
Appreciate the pointer! updated to follow the Qwen2VL dual conv2d pattern.
Instead of:
raw_image → permute → repeat_4d → reshape → split → conv2d(k0, frame0) + conv2d(k1, frame1)
Now:
raw_image → conv2d(k0, raw) + conv2d(k1, raw)
tools/mtmd/clip.cpp
Outdated
| // - Extract frame_1 = patches[:,:,1,:,:] [576, 3, 14, 14] | ||
| // - Conv2d(kernel_0, frame_0) + Conv2d(kernel_1, frame_1) -> [576, 1536] | ||
|
|
||
| const int patch_features = 3 * 2 * 14 * 14; // 1176 |
There was a problem hiding this comment.
what is 3, 2, and 14? any reasons not to read these info from hparams?
There was a problem hiding this comment.
Thanks @ngxson, removed hard coded values from the the HF implementation debugging session. Now pulls from hparams.
- patch_size → hparams.patch_size
- spatial_merge_size → hparams.n_merge
- image_size → hparams.image_size
tools/mtmd/clip.cpp
Outdated
| cb(frame0, "frame0", -1); | ||
| cb(frame1, "frame1", -1); | ||
|
|
||
| // Apply Conv2d to each frame (simulates Conv3d with temporal kernel split) |
There was a problem hiding this comment.
I feel like this logic is the same as Qwen2/Qwen3 model
|
Also, just want to remind that any AI usages must be explicitly stated in PR description. |
tools/mtmd/clip.cpp
Outdated
| // HF pattern: [h, w, h, w] at chunk_size granularity | ||
| // Chunk 0 (dims 0-31): h position | ||
| // Chunk 1 (dims 32-63): w position | ||
| // Chunk 2 (dims 64-95): h position | ||
| // Chunk 3 (dims 96-127): w position |
There was a problem hiding this comment.
If you are referring to this code, then your implementation is likely incorrect
Indeed, I doubt that >50% of GLM-V's code is just Qwen-VL with a different naming. The (almost) same code can be found in modeling_qwen2_vl.py
|
Remind again because some code looks too suspicious compared to #16600
I refuse to review newer commits if this is unclear. |
Add complete GLM4V (GLM-4.6V-Flash) support including: **Vision Encoder (mtmd):** - Dual Conv2D patch embedding (simulating Conv3D temporal reduction) - M-RoPE using ggml_rope_multi() with [h,w,h,w] position pattern - 2x2 patch merger with downsample convolution - SwiGLU-based merger FFN **LLM Architecture (libllama):** - New LLM_ARCH_GLM4V based on GLM4 with M-RoPE - Uses LLAMA_ROPE_TYPE_MROPE with rope_sections from model config - Reuses ggml_rope_multi() (same as Qwen2VL) for position encoding Key design decisions: - Vision encoder uses ggml_rope_multi() instead of custom RoPE - LLM follows GLM4 structure with M-RoPE (not Qwen2VL structure) - Minimal code: ~300 lines total across all files Tested with GLM-4.6V-Flash producing correct image descriptions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
Sorry to bother you, I just wanted to try this PR. I'm trying to convert an mmproj file, but it keeps giving me the error message: What should I do? |
|
lol, I think the author of this PR doesn't actually know what he's doing, but only copy-paste my comment to claude code. I'm closing this PR as it is too noisy. Will have a look a bit later because it's not that complicated |
Summary
Add complete GLM4V (GLM-4.6V-Flash) vision-language model support.
This PR has been completely rebased and rewritten based on maintainer feedback from @ngxson. Key changes from the original submission:
ggml_rope_multi()(same as Qwen2VL)Changes
Vision Encoder (
tools/mtmd/):ggml_rope_multi()with[h,w,h,w]position patternLLM Architecture (
src/):LLM_ARCH_GLM4Vbased on GLM4 structure with M-RoPELLAMA_ROPE_TYPE_MROPEwithrope_sectionsfrom model configggml_rope_multi()for position encoding (shared with Qwen2VL)Files Changed
src/llama-arch.hLLM_ARCH_GLM4Venumsrc/llama-arch.cppsrc/models/glm4v.cppsrc/models/models.hsrc/llama-model.cppsrc/CMakeLists.txttools/mtmd/models/glm4v.cpptools/mtmd/clip.cpptools/mtmd/clip-impl.htools/mtmd/clip-model.htools/mtmd/models/models.htools/mtmd/CMakeLists.txtTesting
Output correctly describes image content.
Design Notes
ggml_rope_multiwithGGML_ROPE_TYPE_VISION)ggml_rope_multiwithLLAMA_ROPE_TYPE_MROPE(same as Qwen2VL)Addresses feedback from #17967 review.