model: add Falcon OCR support by avirajBevli · Pull Request #21045 · ggml-org/llama.cpp

avirajBevli · 2026-03-26T22:07:50Z

Overview

Add support for the FalconOCR model, a VLM designed for document OCR.
Key features of the model:

3D spatial RoPE: 1D temporal + 2D golden-ratio spatial positioning for image patches
Attention sinks: learnable per-layer sink vectors prepended to KV
Squared ReLU gated FFN with sqrt(2) gate scaling
QK layer normalization before attention
Combined non-causal batch for prefix tokens + image patches + suffix token
Image preprocessing: aspect-ratio-preserving resize

🤗 Blogpost: https://huggingface.co/tiiuae/Falcon-OCR
📄 Paper: https://arxiv.org/pdf/2603.27365
💻 Code: https://github.com/tiiuae/falcon-perception
🎮 Playground: https://vision.falcon.aidrc.tii.ae/

Requirements

I have read and agree with the contributing guidelines : YES
AI usage disclosure: YES - AI (Cursor) was used in an assistive capacity for integrating the model into the llama.cpp codebase.

Add support for the Falcon OCR model architecture, a decoder-only VLM designed for document OCR. Key features: - 3D RoPE: 1D temporal + 2D golden-ratio spatial positioning for image patches - Attention sinks: learnable per-layer sink vectors prepended to KV - Squared ReLU gated FFN with sqrt(2) gate scaling - QK layer normalization before attention - Conv2D vision projector (patchification + linear projection) - Combined non-causal batch for prefix tokens + image patches + suffix token - Two-step image preprocessing: aspect-ratio-preserving resize + patch alignment Components: - GGUF conversion: split fused wqkv→Q/K/V, split w13→gate/up with sqrt(2) scaling - LLM graph builder in src/models/falcon_ocr.cpp - Vision projector in tools/mtmd/models/falcon_ocr.cpp - Image preprocessor in tools/mtmd/mtmd-image.cpp - Multimodal helper logic in tools/mtmd/mtmd-helper.cpp - New public API: llama_model_token_to_embd() for embedding lookup

ggml-gh-bot · 2026-03-26T22:12:23Z

Hi @avirajBevli, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.
Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

ngxson · 2026-03-26T23:00:49Z

+// whether the current model uses spatial 3D RoPE (temporal + 2D continuous spatial positions)
+MTMD_API bool mtmd_decode_use_spatial_3d_rope(mtmd_context * ctx);


is this mrope?

No, this is not standard mrope.
Although our model reuses the M-RoPE 4D position layout for convenience, our spatial_3d_rope is different.

The spatial rotation uses learned per-head frequencies (stored as a rope_freqs_golden tensor) rather than the fixed geometric schedule.

The first half of head_dim gets standard 1D RoPE via ggml_rope_ext, and the second half gets a custom 2D rotation with the learned frequencies.

This is why it can't go through the existing M-RoPE code path.

geometric schedule

wow, new tech, what is that?

The first half of head_dim gets standard 1D RoPE via ggml_rope_ext, and the second half gets a custom 2D rotation with the learned frequencies.

it is M-RoPE...

Not exactly. We have the following differences which are making it difficult for me to use existing MRope infrastructure in llama.cpp codebase:

The spatial frequencies are learned rather than calculated using the standard theta = base(-2i/d) formula. So if I were to use the ggml_rope_multi function, I would have to pass in the frequencies tensor to the function, which is not supported currently. Also, freq_factors only divides the geometric theta; it doesn't replace it.

The learned spatial frequencies are different for each head

MRope uses sections[] to assign each dim pair to one position. Here each section operates on position independently. In our model, this is different because we compute theta = freq_h * pos_h + freq_w * pos_w. i.e. In our model, each dim pair sees a weighted combination of both h and w

ggml_rope_multi function expects input positions to be GGML_TYPE_I32. However our spatial positions are continuous floating point values because we apply normalization to coordinates.

It is possible that there are gaps in my understanding. Please let me know if that is the case!
Happy to discuss if there's a way to restructure this that I'm not seeing.

@ngxson the model is released

Please refer to the following links for more details:
https://huggingface.co/tiiuae/Falcon-OCR
https://github.com/tiiuae/falcon-perception

@avirajBevli I had a deeper look into your impl and I kinda disagree with what you said above. Would be appreciated if you can confirm what I'm saying here is correct:

Also, freq_factors only divides the geometric theta; it doesn't replace it.

Yes, it can. Currently, you calculate the theta by:

ggml_tensor * theta = ggml_mul_mat(ctx0, freqs_flat, pos_hw); // pos_hw is the scaled position data

If you look into rope.cu, theta is calculated in GGML by:

const float theta_base = pos[i2]*powf(theta_scale, i0/2.0f); const float freq_factor = has_ff ? freq_factors[i0/2] : 1.0f; // final theta = theta_base/freq_factor

So what that means is that if theta_scale (aka freq_base) is 1.0f, then powf(theta_scale, i0/2.0f) will also be 1.0f; you can then provide the inverse of your desired freq via freq_factors

And you don't even need to pre-scale the position value via POS_SCALE_INV = 1.0f / 1000000.0f;, simply pre-scale your freq_factors

And since mrope kernel works the same way, you can just reuse mrope with freq_base = 1.0f and your custom freq_factors

MRope uses sections[] to assign each dim pair to one position. Here each section operates on position independently. In our model, this is different because we compute theta = freq_h * pos_h + freq_w * pos_w. i.e. In our model, each dim pair sees a weighted combination of both h and w

Would it possible to firstly rotate by theta0 = freq_h * pos_h , then rotate again with theta1 = freq_w * pos_w ?

In other words, that means calling ggml_rope_ext twice, something like:

x = ggml_rope_ext(x, pos_h, freqs_h); x = ggml_rope_ext(x, pos_w, freqs_w);

@ngxson thank you for having an deeper look.

Would it possible to firstly rotate by theta0 = freq_h * pos_h , then rotate again with theta1 = freq_w * pos_w

This should be doable: mathematically equivalent.

So what that means is that if theta_scale (aka freq_base) is 1.0f, then powf(theta_scale, i0/2.0f) will also be 1.0f; you can then provide the inverse of your desired freq via freq_factors

This simple trick let's me provide my learned freqs via freq_factors. This works!

Thanks for both the insights!

So now in my understanding, the only blocker to not being able to directly use the existing rope infra in llama.cpp for falcon ocr model is the fact that our model has separate learned frequency for each head.
In falcon_ocr.cpp, rope_freqs_golden tensor is [2, head_dim/4, n_head] — each head has its own unique set of learned frequencies

While the current rope infrastructure in llama.cpp expects the same freq_factor for all heads. This is because in rope.cu
const float freq_factor = has_ff ? freq_factors[i0/2] : 1.0f;

freq_factors is indexed only by dimension pair (i0/2), with no head dimension (i1) — so the same value must apply to all heads

Can you please confirm this?

While the current rope infrastructure in llama.cpp expects the same freq_factor for all heads. This is because in rope.cu
const float freq_factor = has_ff ? freq_factors[i0/2] : 1.0f;

Yes that is correct. However, one more idea is that you can flatten the head, for example:

[head_dim, n_head, n_token] --> [hidden_size, 1, n_token]

Then apply ggml_rope_ext with the flatten freq_cis_golden. That should be mathematically equivalent to apply per-head frequency.

I'm working on the implementation of falcon-ocr on my fork, with some extra clean up, the link is here: ngxson#100

Still something is still wrong and the model still generates <file_sep> tokens non-stop. So, it would be appreciated if you have a look on the mentioned PR. Feel free to push a PR to my fork (base branch xsn/falcon-ocr) if you found the fix, thanks in advance!

Yes that is correct. However, one more idea is that you can flatten the head, for example:

[head_dim, n_head, n_token] --> [hidden_size, 1, n_token]

Then apply ggml_rope_ext with the flatten freq_cis_golden. That should be mathematically equivalent to apply per-head frequency.

Yes. this makes sense!

I'm working on the implementation of falcon-ocr on my fork, with some extra clean up, the link is here: ngxson#100
Thank you for creating the fork. Really excited to get this going forward!

Still something is still wrong and the model still generates <file_sep> tokens non-stop. So, it would be appreciated if you have a look on the mentioned PR. Feel free to push a PR to my fork (base branch xsn/falcon-ocr) if you found the fix, thanks in advance!

Yeah, when I run inference on your branch in your fork, I get garbage inference results. In my branch, the results are correct.

Let me find the root cause and update here when I find the fix! Thanks

ngxson · 2026-03-26T23:06:27Z

Add support for the FalconOCR model, a VLM designed for document OCR.

I cannot find the model anywhere, an important reminder that we do NOT support closed-weight models.

if this is a model to be released, I honestly suggest removing the "interchunk_merge" paligemma-style. otherwise, we won't allow this much code added just to support a single model

ggerganov · 2026-03-27T06:53:07Z

+        props_dev->has_simdgroup_mm && ne00 >= 64 && ne11 > ne11_mm_min &&
+        !(ggml_get_op_params_i32(op, 0) == GGML_PREC_F32 &&
+          op->src[0]->type == GGML_TYPE_F32 && op->src[1]->type == GGML_TYPE_F32)) {


What is the reason to exclude the F32 path here? Did you observe a slowdown, or?

We observed that our model produces gibberish results on metal. The reason for this is Falcon OCR's Squared ReLU activation produces values exceeding 65504 (FP16 max), causing overflow.

This guard prevents mul_mm (which uses FP16 intermediates) from being selected when GGML_PREC_F32 is explicitly requested and both inputs are F32. With this guard, the matmul in metal happens in FP32.

Note that the GGML_PREC_F32 flag is already set by llama-graph.cpp for affected architectures (GLM4, JAIS2, Falcon OCR), but the issue is that Metal's mul_mm path was ignoring it (because there is casting to FP16 intermediates).

Just to confirm, this is the multiplication that is problematic?

llama.cpp/src/llama-graph.cpp

Lines 1135 to 1141 in d0a8568

if (down) {

cur = build_lora_mm(down, cur);

if (arch == LLM_ARCH_GLM4 || arch == LLM_ARCH_GLM4_MOE || arch == LLM_ARCH_JAIS2) {

// GLM4, GLM4_MOE, and JAIS2 seem to have numerical issues with half-precision accumulators

ggml_mul_mat_set_prec(cur, GGML_PREC_F32);

}

}

yes. this one

Address reviewer feedback (ngxson): remove interchunk_merge pattern from shared mtmd-helper.cpp. Prefix tokens (cls+regs) are now stored inside the IMAGE chunk during tokenization and their embeddings are prepended during encoding in mtmd_encode(), so the standard chunk evaluator handles everything without model-specific orchestration. Removed: falcon_ocr_eval_image_with_prefix, eval_chunks_falcon_ocr, mtmd_decode_use_interchunk_merge. Net ~135 lines removed from shared code.

ngxson · 2026-03-27T11:22:35Z

A general note that most llama.cpp's maintainers already had AI platforms subscriptions and we can generate the same code that you are pushing here easily.

For that reason, we value more discussions about technical aspects and API design. It is not productive for us to read a predominantly AI-generated PR.

avirajBevli · 2026-03-27T06:12:38Z

This guard prevents mul_mm (which uses FP16 intermediates) from being selected when GGML_PREC_F32 is explicitly requested and both inputs are F32. Without this, Falcon OCR's Squared ReLU activation produces values exceeding 65504 (FP16 max), causing overflow in the FFN down-projection on Metal.

Note that the GGML_PREC_F32 flag is already set by llama-graph.cpp for affected architectures (GLM4, JAIS2, Falcon OCR), but the issue is that Metal's mul_mm path was ignoring it.

Note: this is a ggml-level fix. Please let me know if this should be submitted as a separate PR to ggml, or if it's fine to include here.

avirajBevli · 2026-03-27T07:47:30Z

+        props_dev->has_simdgroup_mm && ne00 >= 64 && ne11 > ne11_mm_min &&
+        !(ggml_get_op_params_i32(op, 0) == GGML_PREC_F32 &&
+          op->src[0]->type == GGML_TYPE_F32 && op->src[1]->type == GGML_TYPE_F32)) {


We observed that our model produces gibberish results on metal. The reason for this is Falcon OCR's Squared ReLU activation produces values exceeding 65504 (FP16 max), causing overflow.

This guard prevents mul_mm (which uses FP16 intermediates) from being selected when GGML_PREC_F32 is explicitly requested and both inputs are F32. With this guard, the matmul in metal happens in FP32.

Note that the GGML_PREC_F32 flag is already set by llama-graph.cpp for affected architectures (GLM4, JAIS2, Falcon OCR), but the issue is that Metal's mul_mm path was ignoring it (because there is casting to FP16 intermediates).

avirajBevli · 2026-03-27T07:55:54Z

+// whether the current model uses spatial 3D RoPE (temporal + 2D continuous spatial positions)
+MTMD_API bool mtmd_decode_use_spatial_3d_rope(mtmd_context * ctx);


No, this is not standard mrope.
Although our model reuses the M-RoPE 4D position layout for convenience, our spatial_3d_rope is different.

The spatial rotation uses learned per-head frequencies (stored as a rope_freqs_golden tensor) rather than the fixed geometric schedule.

The first half of head_dim gets standard 1D RoPE via ggml_rope_ext, and the second half gets a custom 2D rotation with the learned frequencies.

This is why it can't go through the existing M-RoPE code path.

YasserdahouML · 2026-03-27T13:55:16Z

@ngxson thanks for the feedback, i get it. once this lands it becomes your maintenance issues, and right now it’s a lot of new code / new modes for a model that isn't yet introduced. We will release over the next few days and will come back to you with all the required materials to find the best way to support our arch within the best practices of llama.cpp if possible, thanks

…sion time (rather than runtime)

YasserdahouML · 2026-04-10T16:24:07Z

@ngxson the model has been released, will you look at this PR?

ngxson · 2026-04-10T16:43:36Z

Sorry @avirajBevli for being quite hard on the review. I've been quite busy during the gemma 4 release. Will have a look on in the next week or the week after

ngxson · 2026-04-14T23:17:42Z

+// mtmd_image_preprocessor_falcon_ocr
+//
+
+bool mtmd_image_preprocessor_falcon_ocr::preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) {


Note that this was replaced by mtmd_image_preprocessor_dyn_size on ngxson#100 , because it does the same thing: resize the image to a fixed pixel budget and align to patch size

Please let me know if my replacement is correct.

Please note that this is not fully correct.

Falcon ocr does preprocessing in 2 steps:

Fit image into [min_edge_dim, max_edge_dim] while preserving aspect ratio

Fit image into [min_pixel_count, max_pixel_count] while preserving aspect ratio and align to patch size

In contrast, seems like mtmd_image_preprocessor_dyn_size does only the second step.

Essentially, step 1 is an extra filtering to add a per-dimension clamping on images with extreme aspect ratios

ngxson · 2026-04-14T23:20:40Z

+        for (int i = 0; i < n_prefix; i++) {
+            pos[i            ] = pos_0;
+            pos[i + n_total  ] = 0;
+            pos[i + n_total*2] = 0;
+            pos[i + n_total*3] = 0;
+        }
+
+        for (int y = 0; y < ny; y++) {
+            const float h_pos = (ny > 1)
+                ? -ylim + y * (2.0f * ylim) / (ny - 1)
+                : 0.0f;
+            for (int x = 0; x < nx; x++) {
+                const float w_pos = (nx > 1)
+                    ? -xlim + x * (2.0f * xlim) / (nx - 1)
+                    : 0.0f;
+                int i = n_prefix + y * nx + x;
+                pos[i            ] = pos_0;
+                pos[i + n_total  ] = (llama_pos)(h_pos * SPATIAL_3D_POS_SCALE);
+                pos[i + n_total*2] = (llama_pos)(w_pos * SPATIAL_3D_POS_SCALE);
+                pos[i + n_total*3] = 0;
+            }
+        }


Question here: pos_0 never seems to be increased, I suspect there should be a bug in this code.

If I understand correctly, if we have 3 text tokens followed by 3 image tokens, then the temporal should be linear for text (0, 1, 2) and then fixed for image (4, 4, 4)

I implemented the mentioned fix on ngxson#100 , inside mtmd_image_tokens_get_decoder_pos

Question here: pos_0 never seems to be increased, I suspect there should be a bug in this code.

I implemented this way by design because consider a sample prompt, which is tokenized to produce the following tokens (in order):
<image_CLS> <image_reg1> ... <image_reg4>.... list of image patch tokens... ...

Following are the intended positions for this sequence:

Please refer to this for the temporal position calculation in HF:
https://huggingface.co/tiiuae/Falcon-Perception/blob/main/processing_falcon_perception.py#L310-L312

Please note that this is different from the temporal positions you mention in:
#21851

This might be a source of misunderstanding

ngxson · 2026-04-14T23:24:21Z

note that mtmd_image_tokens_get_n_pos() may also need to be modified (done on ngxson#100)

on M-RoPE models, this is to count the number of temporal positions (not the number of tokens or KV slots) for a given image. for M-RoPE, an image takes max(nx,ny) positions

for falcon-ocr, it seems like the positions is equal to n_boi + 1, with n_boi is the n_prefix in your version here, and plus 1 because the whole image takes one single temporal position. please correct if I'm wrong here

This is also related to the fact that the 5 image cls + reg tokens share the same temporal position (as mentioned in the previous comment).

Please refer to this for the temporal position calculation in HF:
https://huggingface.co/tiiuae/Falcon-Perception/blob/main/processing_falcon_perception.py#L310-L312

Please note that this is different from the temporal positions you mention in:
#21851

ngxson · 2026-04-14T23:26:53Z

+    // Original KV head count before GQA expansion (e.g. Falcon OCR)
+    uint32_t n_head_kv_orig = 0;


probably we should assume the model to have n_head == n_head_kv as the n_head_kv is mostly used by KV cache implementation (and we do want to have K dim == Q dim here because of the golden rope)

I implemented the solution above on my fork, which makes the code a bit less hacky

Makes sense!

blap · 2026-04-24T13:01:19Z

.

avirajBevli requested review from a team, CISC and ggerganov as code owners March 26, 2026 22:07

github-actions Bot added model Model specific examples python python script changes ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 26, 2026

ngxson reviewed Mar 26, 2026

View reviewed changes

ggerganov reviewed Mar 27, 2026

View reviewed changes

avirajBevli commented Mar 27, 2026

View reviewed changes

ngxson reviewed Mar 27, 2026

View reviewed changes

Comment thread src/models/falcon_ocr.cpp

ngxson reviewed Mar 27, 2026

View reviewed changes

Comment thread tools/mtmd/mtmd.cpp Outdated

ngxson reviewed Mar 27, 2026

View reviewed changes

Comment thread src/models/falcon_ocr.cpp

refactor: calculate image registration token embeddings during conver…

1a28c4c

…sion time (rather than runtime)

ngxson mentioned this pull request Apr 13, 2026

mtmd: add mtmd_image_tokens_get_decoder_pos() API #21851

Merged

ngxson reviewed Apr 14, 2026

View reviewed changes

avirajBevli mentioned this pull request Apr 17, 2026

(WIP) falcon-ocr, for discussion ngxson/llama.cpp#100

Draft

		// whether the current model uses spatial 3D RoPE (temporal + 2D continuous spatial positions)
		MTMD_API bool mtmd_decode_use_spatial_3d_rope(mtmd_context * ctx);

	if (down) {
	cur = build_lora_mm(down, cur);
	if (arch == LLM_ARCH_GLM4 \|\| arch == LLM_ARCH_GLM4_MOE \|\| arch == LLM_ARCH_JAIS2) {
	// GLM4, GLM4_MOE, and JAIS2 seem to have numerical issues with half-precision accumulators
	ggml_mul_mat_set_prec(cur, GGML_PREC_F32);
	}
	}

		// Original KV head count before GQA expansion (e.g. Falcon OCR)
		uint32_t n_head_kv_orig = 0;

Conversation

avirajBevli commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

ggml-gh-bot Bot commented Mar 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngxson commented Mar 26, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson commented Mar 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YasserdahouML commented Mar 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YasserdahouML commented Apr 10, 2026

Uh oh!

ngxson commented Apr 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avirajBevli Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avirajBevli commented Mar 26, 2026 •

edited

Loading

ngxson Apr 12, 2026 •

edited

Loading

ngxson Apr 12, 2026 •

edited

Loading

avirajBevli Apr 15, 2026 •

edited

Loading

ngxson Apr 14, 2026 •

edited

Loading