Skip to content

model: add Falcon OCR support#21045

Open
avirajBevli wants to merge 3 commits into
ggml-org:masterfrom
avirajBevli:falcon_ocr_clean
Open

model: add Falcon OCR support#21045
avirajBevli wants to merge 3 commits into
ggml-org:masterfrom
avirajBevli:falcon_ocr_clean

Conversation

@avirajBevli

@avirajBevli avirajBevli commented Mar 26, 2026

Copy link
Copy Markdown

Overview

Add support for the FalconOCR model, a VLM designed for document OCR.
Key features of the model:

  • 3D spatial RoPE: 1D temporal + 2D golden-ratio spatial positioning for image patches
  • Attention sinks: learnable per-layer sink vectors prepended to KV
  • Squared ReLU gated FFN with sqrt(2) gate scaling
  • QK layer normalization before attention
  • Combined non-causal batch for prefix tokens + image patches + suffix token
  • Image preprocessing: aspect-ratio-preserving resize

🤗 Blogpost: https://huggingface.co/tiiuae/Falcon-OCR
📄 Paper: https://arxiv.org/pdf/2603.27365
💻 Code: https://github.com/tiiuae/falcon-perception
🎮 Playground: https://vision.falcon.aidrc.tii.ae/

Requirements

  • I have read and agree with the contributing guidelines : YES
  • AI usage disclosure: YES - AI (Cursor) was used in an assistive capacity for integrating the model into the llama.cpp codebase.

Add support for the Falcon OCR model architecture, a decoder-only VLM
designed for document OCR. Key features:

- 3D RoPE: 1D temporal + 2D golden-ratio spatial positioning for image patches
- Attention sinks: learnable per-layer sink vectors prepended to KV
- Squared ReLU gated FFN with sqrt(2) gate scaling
- QK layer normalization before attention
- Conv2D vision projector (patchification + linear projection)
- Combined non-causal batch for prefix tokens + image patches + suffix token
- Two-step image preprocessing: aspect-ratio-preserving resize + patch alignment

Components:
- GGUF conversion: split fused wqkv→Q/K/V, split w13→gate/up with sqrt(2) scaling
- LLM graph builder in src/models/falcon_ocr.cpp
- Vision projector in tools/mtmd/models/falcon_ocr.cpp
- Image preprocessor in tools/mtmd/mtmd-image.cpp
- Multimodal helper logic in tools/mtmd/mtmd-helper.cpp
- New public API: llama_model_token_to_embd() for embedding lookup
@avirajBevli avirajBevli requested review from a team, CISC and ggerganov as code owners March 26, 2026 22:07
@ggml-gh-bot

ggml-gh-bot Bot commented Mar 26, 2026

Copy link
Copy Markdown

Hi @avirajBevli, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@github-actions github-actions Bot added model Model specific examples python python script changes ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 26, 2026
Comment thread tools/mtmd/mtmd.h
Comment on lines +122 to +123
// whether the current model uses spatial 3D RoPE (temporal + 2D continuous spatial positions)
MTMD_API bool mtmd_decode_use_spatial_3d_rope(mtmd_context * ctx);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this mrope?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is not standard mrope.
Although our model reuses the M-RoPE 4D position layout for convenience, our spatial_3d_rope is different.

  • The spatial rotation uses learned per-head frequencies (stored as a rope_freqs_golden tensor) rather than the fixed geometric schedule.
  • The first half of head_dim gets standard 1D RoPE via ggml_rope_ext, and the second half gets a custom 2D rotation with the learned frequencies.

This is why it can't go through the existing M-RoPE code path.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geometric schedule

wow, new tech, what is that?

The first half of head_dim gets standard 1D RoPE via ggml_rope_ext, and the second half gets a custom 2D rotation with the learned frequencies.

it is M-RoPE...

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. We have the following differences which are making it difficult for me to use existing MRope infrastructure in llama.cpp codebase:

  1. The spatial frequencies are learned rather than calculated using the standard theta = base(-2i/d) formula. So if I were to use the ggml_rope_multi function, I would have to pass in the frequencies tensor to the function, which is not supported currently. Also, freq_factors only divides the geometric theta; it doesn't replace it.

  2. The learned spatial frequencies are different for each head

  3. MRope uses sections[] to assign each dim pair to one position. Here each section operates on position independently. In our model, this is different because we compute theta = freq_h * pos_h + freq_w * pos_w. i.e. In our model, each dim pair sees a weighted combination of both h and w

  4. ggml_rope_multi function expects input positions to be GGML_TYPE_I32. However our spatial positions are continuous floating point values because we apply normalization to coordinates.

It is possible that there are gaps in my understanding. Please let me know if that is the case!
Happy to discuss if there's a way to restructure this that I'm not seeing.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson the model is released

Please refer to the following links for more details:
https://huggingface.co/tiiuae/Falcon-OCR
https://github.com/tiiuae/falcon-perception

@ngxson ngxson Apr 12, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avirajBevli I had a deeper look into your impl and I kinda disagree with what you said above. Would be appreciated if you can confirm what I'm saying here is correct:

Also, freq_factors only divides the geometric theta; it doesn't replace it.

Yes, it can. Currently, you calculate the theta by:

ggml_tensor * theta = ggml_mul_mat(ctx0, freqs_flat, pos_hw);
// pos_hw is the scaled position data

If you look into rope.cu, theta is calculated in GGML by:

const float theta_base = pos[i2]*powf(theta_scale, i0/2.0f);
const float freq_factor = has_ff ? freq_factors[i0/2] : 1.0f;
// final theta = theta_base/freq_factor

So what that means is that if theta_scale (aka freq_base) is 1.0f, then powf(theta_scale, i0/2.0f) will also be 1.0f; you can then provide the inverse of your desired freq via freq_factors

And you don't even need to pre-scale the position value via POS_SCALE_INV = 1.0f / 1000000.0f;, simply pre-scale your freq_factors

And since mrope kernel works the same way, you can just reuse mrope with freq_base = 1.0f and your custom freq_factors

@ngxson ngxson Apr 12, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MRope uses sections[] to assign each dim pair to one position. Here each section operates on position independently. In our model, this is different because we compute theta = freq_h * pos_h + freq_w * pos_w. i.e. In our model, each dim pair sees a weighted combination of both h and w

Would it possible to firstly rotate by theta0 = freq_h * pos_h , then rotate again with theta1 = freq_w * pos_w ?

In other words, that means calling ggml_rope_ext twice, something like:

x = ggml_rope_ext(x, pos_h, freqs_h);
x = ggml_rope_ext(x, pos_w, freqs_w);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson thank you for having an deeper look.

Would it possible to firstly rotate by theta0 = freq_h * pos_h , then rotate again with theta1 = freq_w * pos_w

This should be doable: mathematically equivalent.

So what that means is that if theta_scale (aka freq_base) is 1.0f, then powf(theta_scale, i0/2.0f) will also be 1.0f; you can then provide the inverse of your desired freq via freq_factors

This simple trick let's me provide my learned freqs via freq_factors. This works!

Thanks for both the insights!

So now in my understanding, the only blocker to not being able to directly use the existing rope infra in llama.cpp for falcon ocr model is the fact that our model has separate learned frequency for each head.
In falcon_ocr.cpp, rope_freqs_golden tensor is [2, head_dim/4, n_head] — each head has its own unique set of learned frequencies

While the current rope infrastructure in llama.cpp expects the same freq_factor for all heads. This is because in rope.cu
const float freq_factor = has_ff ? freq_factors[i0/2] : 1.0f;

freq_factors is indexed only by dimension pair (i0/2), with no head dimension (i1) — so the same value must apply to all heads

Can you please confirm this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the current rope infrastructure in llama.cpp expects the same freq_factor for all heads. This is because in rope.cu
const float freq_factor = has_ff ? freq_factors[i0/2] : 1.0f;

Yes that is correct. However, one more idea is that you can flatten the head, for example:

[head_dim, n_head, n_token] --> [hidden_size, 1, n_token]

Then apply ggml_rope_ext with the flatten freq_cis_golden. That should be mathematically equivalent to apply per-head frequency.

I'm working on the implementation of falcon-ocr on my fork, with some extra clean up, the link is here: ngxson#100

Still something is still wrong and the model still generates <file_sep> tokens non-stop. So, it would be appreciated if you have a look on the mentioned PR. Feel free to push a PR to my fork (base branch xsn/falcon-ocr) if you found the fix, thanks in advance!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is correct. However, one more idea is that you can flatten the head, for example:

[head_dim, n_head, n_token] --> [hidden_size, 1, n_token]

Then apply ggml_rope_ext with the flatten freq_cis_golden. That should be mathematically equivalent to apply per-head frequency.

Yes. this makes sense!

I'm working on the implementation of falcon-ocr on my fork, with some extra clean up, the link is here: ngxson#100
Thank you for creating the fork. Really excited to get this going forward!

Still something is still wrong and the model still generates <file_sep> tokens non-stop. So, it would be appreciated if you have a look on the mentioned PR. Feel free to push a PR to my fork (base branch xsn/falcon-ocr) if you found the fix, thanks in advance!

Yeah, when I run inference on your branch in your fork, I get garbage inference results. In my branch, the results are correct.

Let me find the root cause and update here when I find the fix! Thanks

Comment thread tools/mtmd/mtmd.h Outdated
@ngxson

ngxson commented Mar 26, 2026

Copy link
Copy Markdown
Collaborator

Add support for the FalconOCR model, a VLM designed for document OCR.

I cannot find the model anywhere, an important reminder that we do NOT support closed-weight models.

if this is a model to be released, I honestly suggest removing the "interchunk_merge" paligemma-style. otherwise, we won't allow this much code added just to support a single model

Comment thread include/llama.h Outdated
Comment on lines +2146 to +2148
props_dev->has_simdgroup_mm && ne00 >= 64 && ne11 > ne11_mm_min &&
!(ggml_get_op_params_i32(op, 0) == GGML_PREC_F32 &&
op->src[0]->type == GGML_TYPE_F32 && op->src[1]->type == GGML_TYPE_F32)) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to exclude the F32 path here? Did you observe a slowdown, or?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We observed that our model produces gibberish results on metal. The reason for this is Falcon OCR's Squared ReLU activation produces values exceeding 65504 (FP16 max), causing overflow.

This guard prevents mul_mm (which uses FP16 intermediates) from being selected when GGML_PREC_F32 is explicitly requested and both inputs are F32. With this guard, the matmul in metal happens in FP32.

Note that the GGML_PREC_F32 flag is already set by llama-graph.cpp for affected architectures (GLM4, JAIS2, Falcon OCR), but the issue is that Metal's mul_mm path was ignoring it (because there is casting to FP16 intermediates).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, this is the multiplication that is problematic?

llama.cpp/src/llama-graph.cpp

Lines 1135 to 1141 in d0a8568

if (down) {
cur = build_lora_mm(down, cur);
if (arch == LLM_ARCH_GLM4 || arch == LLM_ARCH_GLM4_MOE || arch == LLM_ARCH_JAIS2) {
// GLM4, GLM4_MOE, and JAIS2 seem to have numerical issues with half-precision accumulators
ggml_mul_mat_set_prec(cur, GGML_PREC_F32);
}
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. this one

Address reviewer feedback (ngxson): remove interchunk_merge pattern from
shared mtmd-helper.cpp. Prefix tokens (cls+regs) are now stored inside
the IMAGE chunk during tokenization and their embeddings are prepended
during encoding in mtmd_encode(), so the standard chunk evaluator
handles everything without model-specific orchestration.
Removed: falcon_ocr_eval_image_with_prefix, eval_chunks_falcon_ocr,
mtmd_decode_use_interchunk_merge. Net ~135 lines removed from shared code.
@ngxson

ngxson commented Mar 27, 2026

Copy link
Copy Markdown
Collaborator

A general note that most llama.cpp's maintainers already had AI platforms subscriptions and we can generate the same code that you are pushing here easily.

For that reason, we value more discussions about technical aspects and API design. It is not productive for us to read a predominantly AI-generated PR.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guard prevents mul_mm (which uses FP16 intermediates) from being selected when GGML_PREC_F32 is explicitly requested and both inputs are F32. Without this, Falcon OCR's Squared ReLU activation produces values exceeding 65504 (FP16 max), causing overflow in the FFN down-projection on Metal.

Note that the GGML_PREC_F32 flag is already set by llama-graph.cpp for affected architectures (GLM4, JAIS2, Falcon OCR), but the issue is that Metal's mul_mm path was ignoring it.

Note: this is a ggml-level fix. Please let me know if this should be submitted as a separate PR to ggml, or if it's fine to include here.

Comment on lines +2146 to +2148
props_dev->has_simdgroup_mm && ne00 >= 64 && ne11 > ne11_mm_min &&
!(ggml_get_op_params_i32(op, 0) == GGML_PREC_F32 &&
op->src[0]->type == GGML_TYPE_F32 && op->src[1]->type == GGML_TYPE_F32)) {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We observed that our model produces gibberish results on metal. The reason for this is Falcon OCR's Squared ReLU activation produces values exceeding 65504 (FP16 max), causing overflow.

This guard prevents mul_mm (which uses FP16 intermediates) from being selected when GGML_PREC_F32 is explicitly requested and both inputs are F32. With this guard, the matmul in metal happens in FP32.

Note that the GGML_PREC_F32 flag is already set by llama-graph.cpp for affected architectures (GLM4, JAIS2, Falcon OCR), but the issue is that Metal's mul_mm path was ignoring it (because there is casting to FP16 intermediates).

Comment thread tools/mtmd/mtmd.h
Comment on lines +122 to +123
// whether the current model uses spatial 3D RoPE (temporal + 2D continuous spatial positions)
MTMD_API bool mtmd_decode_use_spatial_3d_rope(mtmd_context * ctx);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is not standard mrope.
Although our model reuses the M-RoPE 4D position layout for convenience, our spatial_3d_rope is different.

  • The spatial rotation uses learned per-head frequencies (stored as a rope_freqs_golden tensor) rather than the fixed geometric schedule.
  • The first half of head_dim gets standard 1D RoPE via ggml_rope_ext, and the second half gets a custom 2D rotation with the learned frequencies.

This is why it can't go through the existing M-RoPE code path.

Comment thread tools/mtmd/mtmd.h Outdated
Comment thread tools/mtmd/mtmd.h Outdated
Comment thread include/llama.h Outdated
@YasserdahouML

Copy link
Copy Markdown

@ngxson thanks for the feedback, i get it. once this lands it becomes your maintenance issues, and right now it’s a lot of new code / new modes for a model that isn't yet introduced. We will release over the next few days and will come back to you with all the required materials to find the best way to support our arch within the best practices of llama.cpp if possible, thanks

Comment thread src/models/falcon_ocr.cpp
Comment thread tools/mtmd/mtmd.cpp Outdated
Comment thread src/models/falcon_ocr.cpp
@YasserdahouML

Copy link
Copy Markdown

@ngxson the model has been released, will you look at this PR?

@ngxson

ngxson commented Apr 10, 2026

Copy link
Copy Markdown
Collaborator

Sorry @avirajBevli for being quite hard on the review. I've been quite busy during the gemma 4 release. Will have a look on in the next week or the week after

Comment thread tools/mtmd/mtmd-image.cpp
// mtmd_image_preprocessor_falcon_ocr
//

bool mtmd_image_preprocessor_falcon_ocr::preprocess(const clip_image_u8 & img, clip_image_f32_batch & output) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this was replaced by mtmd_image_preprocessor_dyn_size on ngxson#100 , because it does the same thing: resize the image to a fixed pixel budget and align to patch size

Please let me know if my replacement is correct.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this is not fully correct.

Falcon ocr does preprocessing in 2 steps:

  1. Fit image into [min_edge_dim, max_edge_dim] while preserving aspect ratio
  2. Fit image into [min_pixel_count, max_pixel_count] while preserving aspect ratio and align to patch size

In contrast, seems like mtmd_image_preprocessor_dyn_size does only the second step.

Essentially, step 1 is an extra filtering to add a per-dimension clamping on images with extreme aspect ratios

Comment on lines +191 to +212
for (int i = 0; i < n_prefix; i++) {
pos[i ] = pos_0;
pos[i + n_total ] = 0;
pos[i + n_total*2] = 0;
pos[i + n_total*3] = 0;
}

for (int y = 0; y < ny; y++) {
const float h_pos = (ny > 1)
? -ylim + y * (2.0f * ylim) / (ny - 1)
: 0.0f;
for (int x = 0; x < nx; x++) {
const float w_pos = (nx > 1)
? -xlim + x * (2.0f * xlim) / (nx - 1)
: 0.0f;
int i = n_prefix + y * nx + x;
pos[i ] = pos_0;
pos[i + n_total ] = (llama_pos)(h_pos * SPATIAL_3D_POS_SCALE);
pos[i + n_total*2] = (llama_pos)(w_pos * SPATIAL_3D_POS_SCALE);
pos[i + n_total*3] = 0;
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question here: pos_0 never seems to be increased, I suspect there should be a bug in this code.

If I understand correctly, if we have 3 text tokens followed by 3 image tokens, then the temporal should be linear for text (0, 1, 2) and then fixed for image (4, 4, 4)

I implemented the mentioned fix on ngxson#100 , inside mtmd_image_tokens_get_decoder_pos

@avirajBevli avirajBevli Apr 15, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question here: pos_0 never seems to be increased, I suspect there should be a bug in this code.

I implemented this way by design because consider a sample prompt, which is tokenized to produce the following tokens (in order):
<image_CLS> <image_reg1> ... <image_reg4>.... list of image patch tokens... ...

Following are the intended positions for this sequence:
image

Please refer to this for the temporal position calculation in HF:
https://huggingface.co/tiiuae/Falcon-Perception/blob/main/processing_falcon_perception.py#L310-L312

Please note that this is different from the temporal positions you mention in:
#21851

This might be a source of misunderstanding

Comment thread tools/mtmd/mtmd.cpp

@ngxson ngxson Apr 14, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that mtmd_image_tokens_get_n_pos() may also need to be modified (done on ngxson#100)

on M-RoPE models, this is to count the number of temporal positions (not the number of tokens or KV slots) for a given image. for M-RoPE, an image takes max(nx,ny) positions

for falcon-ocr, it seems like the positions is equal to n_boi + 1, with n_boi is the n_prefix in your version here, and plus 1 because the whole image takes one single temporal position. please correct if I'm wrong here

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also related to the fact that the 5 image cls + reg tokens share the same temporal position (as mentioned in the previous comment).

Please refer to this for the temporal position calculation in HF:
https://huggingface.co/tiiuae/Falcon-Perception/blob/main/processing_falcon_perception.py#L310-L312

Please note that this is different from the temporal positions you mention in:
#21851

Comment thread src/llama-hparams.h
Comment on lines +223 to +224
// Original KV head count before GQA expansion (e.g. Falcon OCR)
uint32_t n_head_kv_orig = 0;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we should assume the model to have n_head == n_head_kv as the n_head_kv is mostly used by KV cache implementation (and we do want to have K dim == Q dim here because of the golden rope)

I implemented the solution above on my fork, which makes the code a bit less hacky

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

@blap

blap commented Apr 24, 2026

Copy link
Copy Markdown

.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants