model: add Falcon OCR support#21045
Conversation
Add support for the Falcon OCR model architecture, a decoder-only VLM designed for document OCR. Key features: - 3D RoPE: 1D temporal + 2D golden-ratio spatial positioning for image patches - Attention sinks: learnable per-layer sink vectors prepended to KV - Squared ReLU gated FFN with sqrt(2) gate scaling - QK layer normalization before attention - Conv2D vision projector (patchification + linear projection) - Combined non-causal batch for prefix tokens + image patches + suffix token - Two-step image preprocessing: aspect-ratio-preserving resize + patch alignment Components: - GGUF conversion: split fused wqkv→Q/K/V, split w13→gate/up with sqrt(2) scaling - LLM graph builder in src/models/falcon_ocr.cpp - Vision projector in tools/mtmd/models/falcon_ocr.cpp - Image preprocessor in tools/mtmd/mtmd-image.cpp - Multimodal helper logic in tools/mtmd/mtmd-helper.cpp - New public API: llama_model_token_to_embd() for embedding lookup
|
Hi @avirajBevli, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
| // whether the current model uses spatial 3D RoPE (temporal + 2D continuous spatial positions) | ||
| MTMD_API bool mtmd_decode_use_spatial_3d_rope(mtmd_context * ctx); |
There was a problem hiding this comment.
No, this is not standard mrope.
Although our model reuses the M-RoPE 4D position layout for convenience, our spatial_3d_rope is different.
- The spatial rotation uses learned per-head frequencies (stored as a rope_freqs_golden tensor) rather than the fixed geometric schedule.
- The first half of head_dim gets standard 1D RoPE via ggml_rope_ext, and the second half gets a custom 2D rotation with the learned frequencies.
This is why it can't go through the existing M-RoPE code path.
There was a problem hiding this comment.
geometric schedule
wow, new tech, what is that?
The first half of head_dim gets standard 1D RoPE via ggml_rope_ext, and the second half gets a custom 2D rotation with the learned frequencies.
it is M-RoPE...
There was a problem hiding this comment.
Not exactly. We have the following differences which are making it difficult for me to use existing MRope infrastructure in llama.cpp codebase:
-
The spatial frequencies are learned rather than calculated using the standard theta = base(-2i/d) formula. So if I were to use the ggml_rope_multi function, I would have to pass in the frequencies tensor to the function, which is not supported currently. Also, freq_factors only divides the geometric theta; it doesn't replace it.
-
The learned spatial frequencies are different for each head
-
MRope uses sections[] to assign each dim pair to one position. Here each section operates on position independently. In our model, this is different because we compute theta = freq_h * pos_h + freq_w * pos_w. i.e. In our model, each dim pair sees a weighted combination of both h and w
-
ggml_rope_multi function expects input positions to be GGML_TYPE_I32. However our spatial positions are continuous floating point values because we apply normalization to coordinates.
It is possible that there are gaps in my understanding. Please let me know if that is the case!
Happy to discuss if there's a way to restructure this that I'm not seeing.
There was a problem hiding this comment.
@ngxson the model is released
Please refer to the following links for more details:
https://huggingface.co/tiiuae/Falcon-OCR
https://github.com/tiiuae/falcon-perception
There was a problem hiding this comment.
@avirajBevli I had a deeper look into your impl and I kinda disagree with what you said above. Would be appreciated if you can confirm what I'm saying here is correct:
Also, freq_factors only divides the geometric theta; it doesn't replace it.
Yes, it can. Currently, you calculate the theta by:
ggml_tensor * theta = ggml_mul_mat(ctx0, freqs_flat, pos_hw);
// pos_hw is the scaled position dataIf you look into rope.cu, theta is calculated in GGML by:
const float theta_base = pos[i2]*powf(theta_scale, i0/2.0f);
const float freq_factor = has_ff ? freq_factors[i0/2] : 1.0f;
// final theta = theta_base/freq_factorSo what that means is that if theta_scale (aka freq_base) is 1.0f, then powf(theta_scale, i0/2.0f) will also be 1.0f; you can then provide the inverse of your desired freq via freq_factors
And you don't even need to pre-scale the position value via POS_SCALE_INV = 1.0f / 1000000.0f;, simply pre-scale your freq_factors
And since mrope kernel works the same way, you can just reuse mrope with freq_base = 1.0f and your custom freq_factors
There was a problem hiding this comment.
MRope uses sections[] to assign each dim pair to one position. Here each section operates on position independently. In our model, this is different because we compute theta = freq_h * pos_h + freq_w * pos_w. i.e. In our model, each dim pair sees a weighted combination of both h and w
Would it possible to firstly rotate by theta0 = freq_h * pos_h , then rotate again with theta1 = freq_w * pos_w ?
In other words, that means calling ggml_rope_ext twice, something like:
x = ggml_rope_ext(x, pos_h, freqs_h);
x = ggml_rope_ext(x, pos_w, freqs_w);There was a problem hiding this comment.
@ngxson thank you for having an deeper look.
Would it possible to firstly rotate by theta0 = freq_h * pos_h , then rotate again with theta1 = freq_w * pos_w
This should be doable: mathematically equivalent.
So what that means is that if theta_scale (aka freq_base) is 1.0f, then powf(theta_scale, i0/2.0f) will also be 1.0f; you can then provide the inverse of your desired freq via freq_factors
This simple trick let's me provide my learned freqs via freq_factors. This works!
Thanks for both the insights!
So now in my understanding, the only blocker to not being able to directly use the existing rope infra in llama.cpp for falcon ocr model is the fact that our model has separate learned frequency for each head.
In falcon_ocr.cpp, rope_freqs_golden tensor is [2, head_dim/4, n_head] — each head has its own unique set of learned frequencies
While the current rope infrastructure in llama.cpp expects the same freq_factor for all heads. This is because in rope.cu
const float freq_factor = has_ff ? freq_factors[i0/2] : 1.0f;
freq_factors is indexed only by dimension pair (i0/2), with no head dimension (i1) — so the same value must apply to all heads
Can you please confirm this?
There was a problem hiding this comment.
While the current rope infrastructure in llama.cpp expects the same freq_factor for all heads. This is because in rope.cu
const float freq_factor = has_ff ? freq_factors[i0/2] : 1.0f;
Yes that is correct. However, one more idea is that you can flatten the head, for example:
[head_dim, n_head, n_token] --> [hidden_size, 1, n_token]
Then apply ggml_rope_ext with the flatten freq_cis_golden. That should be mathematically equivalent to apply per-head frequency.
I'm working on the implementation of falcon-ocr on my fork, with some extra clean up, the link is here: ngxson#100
Still something is still wrong and the model still generates <file_sep> tokens non-stop. So, it would be appreciated if you have a look on the mentioned PR. Feel free to push a PR to my fork (base branch xsn/falcon-ocr) if you found the fix, thanks in advance!
There was a problem hiding this comment.
Yes that is correct. However, one more idea is that you can flatten the head, for example:
[head_dim, n_head, n_token] --> [hidden_size, 1, n_token]
Then apply ggml_rope_ext with the flatten freq_cis_golden. That should be mathematically equivalent to apply per-head frequency.
Yes. this makes sense!
I'm working on the implementation of falcon-ocr on my fork, with some extra clean up, the link is here: ngxson#100
Thank you for creating the fork. Really excited to get this going forward!
Still something is still wrong and the model still generates <file_sep> tokens non-stop. So, it would be appreciated if you have a look on the mentioned PR. Feel free to push a PR to my fork (base branch xsn/falcon-ocr) if you found the fix, thanks in advance!
Yeah, when I run inference on your branch in your fork, I get garbage inference results. In my branch, the results are correct.
Let me find the root cause and update here when I find the fix! Thanks
I cannot find the model anywhere, an important reminder that we do NOT support closed-weight models. if this is a model to be released, I honestly suggest removing the "interchunk_merge" paligemma-style. otherwise, we won't allow this much code added just to support a single model |
| props_dev->has_simdgroup_mm && ne00 >= 64 && ne11 > ne11_mm_min && | ||
| !(ggml_get_op_params_i32(op, 0) == GGML_PREC_F32 && | ||
| op->src[0]->type == GGML_TYPE_F32 && op->src[1]->type == GGML_TYPE_F32)) { |
There was a problem hiding this comment.
What is the reason to exclude the F32 path here? Did you observe a slowdown, or?
There was a problem hiding this comment.
We observed that our model produces gibberish results on metal. The reason for this is Falcon OCR's Squared ReLU activation produces values exceeding 65504 (FP16 max), causing overflow.
This guard prevents mul_mm (which uses FP16 intermediates) from being selected when GGML_PREC_F32 is explicitly requested and both inputs are F32. With this guard, the matmul in metal happens in FP32.
Note that the GGML_PREC_F32 flag is already set by llama-graph.cpp for affected architectures (GLM4, JAIS2, Falcon OCR), but the issue is that Metal's mul_mm path was ignoring it (because there is casting to FP16 intermediates).
There was a problem hiding this comment.
Just to confirm, this is the multiplication that is problematic?
Lines 1135 to 1141 in d0a8568
Address reviewer feedback (ngxson): remove interchunk_merge pattern from shared mtmd-helper.cpp. Prefix tokens (cls+regs) are now stored inside the IMAGE chunk during tokenization and their embeddings are prepended during encoding in mtmd_encode(), so the standard chunk evaluator handles everything without model-specific orchestration. Removed: falcon_ocr_eval_image_with_prefix, eval_chunks_falcon_ocr, mtmd_decode_use_interchunk_merge. Net ~135 lines removed from shared code.
|
A general note that most llama.cpp's maintainers already had AI platforms subscriptions and we can generate the same code that you are pushing here easily. For that reason, we value more discussions about technical aspects and API design. It is not productive for us to read a predominantly AI-generated PR. |
There was a problem hiding this comment.
This guard prevents mul_mm (which uses FP16 intermediates) from being selected when GGML_PREC_F32 is explicitly requested and both inputs are F32. Without this, Falcon OCR's Squared ReLU activation produces values exceeding 65504 (FP16 max), causing overflow in the FFN down-projection on Metal.
Note that the GGML_PREC_F32 flag is already set by llama-graph.cpp for affected architectures (GLM4, JAIS2, Falcon OCR), but the issue is that Metal's mul_mm path was ignoring it.
Note: this is a ggml-level fix. Please let me know if this should be submitted as a separate PR to ggml, or if it's fine to include here.
| props_dev->has_simdgroup_mm && ne00 >= 64 && ne11 > ne11_mm_min && | ||
| !(ggml_get_op_params_i32(op, 0) == GGML_PREC_F32 && | ||
| op->src[0]->type == GGML_TYPE_F32 && op->src[1]->type == GGML_TYPE_F32)) { |
There was a problem hiding this comment.
We observed that our model produces gibberish results on metal. The reason for this is Falcon OCR's Squared ReLU activation produces values exceeding 65504 (FP16 max), causing overflow.
This guard prevents mul_mm (which uses FP16 intermediates) from being selected when GGML_PREC_F32 is explicitly requested and both inputs are F32. With this guard, the matmul in metal happens in FP32.
Note that the GGML_PREC_F32 flag is already set by llama-graph.cpp for affected architectures (GLM4, JAIS2, Falcon OCR), but the issue is that Metal's mul_mm path was ignoring it (because there is casting to FP16 intermediates).
| // whether the current model uses spatial 3D RoPE (temporal + 2D continuous spatial positions) | ||
| MTMD_API bool mtmd_decode_use_spatial_3d_rope(mtmd_context * ctx); |
There was a problem hiding this comment.
No, this is not standard mrope.
Although our model reuses the M-RoPE 4D position layout for convenience, our spatial_3d_rope is different.
- The spatial rotation uses learned per-head frequencies (stored as a rope_freqs_golden tensor) rather than the fixed geometric schedule.
- The first half of head_dim gets standard 1D RoPE via ggml_rope_ext, and the second half gets a custom 2D rotation with the learned frequencies.
This is why it can't go through the existing M-RoPE code path.
|
@ngxson thanks for the feedback, i get it. once this lands it becomes your maintenance issues, and right now it’s a lot of new code / new modes for a model that isn't yet introduced. We will release over the next few days and will come back to you with all the required materials to find the best way to support our arch within the best practices of llama.cpp if possible, thanks |
…sion time (rather than runtime)
|
@ngxson the model has been released, will you look at this PR? |
|
Sorry @avirajBevli for being quite hard on the review. I've been quite busy during the gemma 4 release. Will have a look on in the next week or the week after |
| // mtmd_image_preprocessor_falcon_ocr | ||
| // | ||
|
|
||
| bool mtmd_image_preprocessor_falcon_ocr::preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) { |
There was a problem hiding this comment.
Note that this was replaced by mtmd_image_preprocessor_dyn_size on ngxson#100 , because it does the same thing: resize the image to a fixed pixel budget and align to patch size
Please let me know if my replacement is correct.
There was a problem hiding this comment.
Please note that this is not fully correct.
Falcon ocr does preprocessing in 2 steps:
- Fit image into [min_edge_dim, max_edge_dim] while preserving aspect ratio
- Fit image into [min_pixel_count, max_pixel_count] while preserving aspect ratio and align to patch size
In contrast, seems like mtmd_image_preprocessor_dyn_size does only the second step.
Essentially, step 1 is an extra filtering to add a per-dimension clamping on images with extreme aspect ratios
| for (int i = 0; i < n_prefix; i++) { | ||
| pos[i ] = pos_0; | ||
| pos[i + n_total ] = 0; | ||
| pos[i + n_total*2] = 0; | ||
| pos[i + n_total*3] = 0; | ||
| } | ||
|
|
||
| for (int y = 0; y < ny; y++) { | ||
| const float h_pos = (ny > 1) | ||
| ? -ylim + y * (2.0f * ylim) / (ny - 1) | ||
| : 0.0f; | ||
| for (int x = 0; x < nx; x++) { | ||
| const float w_pos = (nx > 1) | ||
| ? -xlim + x * (2.0f * xlim) / (nx - 1) | ||
| : 0.0f; | ||
| int i = n_prefix + y * nx + x; | ||
| pos[i ] = pos_0; | ||
| pos[i + n_total ] = (llama_pos)(h_pos * SPATIAL_3D_POS_SCALE); | ||
| pos[i + n_total*2] = (llama_pos)(w_pos * SPATIAL_3D_POS_SCALE); | ||
| pos[i + n_total*3] = 0; | ||
| } | ||
| } |
There was a problem hiding this comment.
Question here: pos_0 never seems to be increased, I suspect there should be a bug in this code.
If I understand correctly, if we have 3 text tokens followed by 3 image tokens, then the temporal should be linear for text (0, 1, 2) and then fixed for image (4, 4, 4)
I implemented the mentioned fix on ngxson#100 , inside mtmd_image_tokens_get_decoder_pos
There was a problem hiding this comment.
Question here: pos_0 never seems to be increased, I suspect there should be a bug in this code.
I implemented this way by design because consider a sample prompt, which is tokenized to produce the following tokens (in order):
<image_CLS> <image_reg1> ... <image_reg4>.... list of image patch tokens... ...
Following are the intended positions for this sequence:

Please refer to this for the temporal position calculation in HF:
https://huggingface.co/tiiuae/Falcon-Perception/blob/main/processing_falcon_perception.py#L310-L312
Please note that this is different from the temporal positions you mention in:
#21851
This might be a source of misunderstanding
There was a problem hiding this comment.
note that mtmd_image_tokens_get_n_pos() may also need to be modified (done on ngxson#100)
on M-RoPE models, this is to count the number of temporal positions (not the number of tokens or KV slots) for a given image. for M-RoPE, an image takes max(nx,ny) positions
for falcon-ocr, it seems like the positions is equal to n_boi + 1, with n_boi is the n_prefix in your version here, and plus 1 because the whole image takes one single temporal position. please correct if I'm wrong here
There was a problem hiding this comment.
This is also related to the fact that the 5 image cls + reg tokens share the same temporal position (as mentioned in the previous comment).
Please refer to this for the temporal position calculation in HF:
https://huggingface.co/tiiuae/Falcon-Perception/blob/main/processing_falcon_perception.py#L310-L312
Please note that this is different from the temporal positions you mention in:
#21851
| // Original KV head count before GQA expansion (e.g. Falcon OCR) | ||
| uint32_t n_head_kv_orig = 0; |
There was a problem hiding this comment.
probably we should assume the model to have n_head == n_head_kv as the n_head_kv is mostly used by KV cache implementation (and we do want to have K dim == Q dim here because of the golden rope)
I implemented the solution above on my fork, which makes the code a bit less hacky
|
. |
Overview
Add support for the FalconOCR model, a VLM designed for document OCR.
Key features of the model:
🤗 Blogpost: https://huggingface.co/tiiuae/Falcon-OCR
📄 Paper: https://arxiv.org/pdf/2603.27365
💻 Code: https://github.com/tiiuae/falcon-perception
🎮 Playground: https://vision.falcon.aidrc.tii.ae/
Requirements