Skip to content

Add support for Tiny Aya Models#19611

Merged
ngxson merged 9 commits intoggml-org:masterfrom
saurabhdash2512:saurabh_tiny_aya
Feb 16, 2026
Merged

Add support for Tiny Aya Models#19611
ngxson merged 9 commits intoggml-org:masterfrom
saurabhdash2512:saurabh_tiny_aya

Conversation

@saurabhdash2512
Copy link
Contributor

@saurabhdash2512 saurabhdash2512 commented Feb 14, 2026

Summary

This PR adds native support for the CohereLabs/tiny-aya family of models in llama.cpp. These models use a distinct BPE pre-tokenizer (tiny_aya) with a custom digit-grouping regex.

Tagging @ngxson for visibility.

@github-actions github-actions bot added the python python script changes label Feb 14, 2026
// tiny_aya digit grouping pattern from tokenizer.json:
// {"type": "Split", "pattern": {"Regex": "\\d{1,3}(?=(?:\\d{3})*\\b)"}, "behavior": "Isolated"}
// Splits digits into groups of 3 from the right (e.g., 1234567 -> 1, 234, 567)
bpe_offsets = unicode_regex_split_custom_afmoe(text, offsets);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not exactly the same though, may be subtle tokenization differences.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, right, @saurabhdash2512 I didn't notice this regex is different. If you prefer fixing this later, we can leave a TODO here and go back later after the model is released

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this with random strings and didn't see any differences, but added a comment incase there are any edge cases I missed.

@saurabhdash2512
Copy link
Contributor Author

@ngxson @CISC I've address the comments! Please let me know if there's anything else needed!

@ngxson ngxson merged commit 5f28c53 into ggml-org:master Feb 16, 2026
81 of 82 checks passed
michaelneale added a commit to michaelneale/llama.cpp that referenced this pull request Feb 17, 2026
* upstream/master: (88 commits)
  ci : bump komac version (ggml-org#19682)
  build : link ws2_32 as PUBLIC on Windows (ggml-org#19666)
  build : cleanup library linking logic (ggml-org#19665)
  convert : add JoyAI-LLM-Flash (ggml-org#19651)
  perplexity: add proper batching (ggml-org#19661)
  common : inline functions (ggml-org#18639)
  ggml : make `ggml_is_view` as API (ggml-org#19539)
  model: Add support for Tiny Aya Models (ggml-org#19611)
  build : rework llama_option_depr to handle LLAMA_CURL (ggml-org#19658)
  Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (ggml-org#19591)
  models : deduplicate delta-net graphs for Qwen family (ggml-org#19597)
  graph : fix KQ mask, lora, cvec reuse checks (ggml-org#19644)
  ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel  (ggml-org#19132)
  sync : ggml
  ggml : bump version to 0.9.7 (ggml/1425)
  ggml : bump version to 0.9.6 (ggml/1423)
  cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (ggml-org#19624)
  docs: update s390x build docs (ggml-org#19643)
  build : remove LLAMA_HTTPLIB option (ggml-org#19623)
  cmake : check if KleidiAI API has been fetched (ggml-org#19640)
  ...
@arch-btw
Copy link
Contributor

@saurabhdash2512 this only works for the base model, all the others use a different tokenizer

@arch-btw
Copy link
Contributor

arch-btw commented Feb 18, 2026

shasum tokenizer.json

CohereLabs/tiny-aya-earth
2227ea9c52e8afb3f98bfed2679008b275f2664de69dfde174b374389eb0225d
CohereLabs/tiny-aya-global
2227ea9c52e8afb3f98bfed2679008b275f2664de69dfde174b374389eb0225d
CohereLabs/tiny-aya-fire
2227ea9c52e8afb3f98bfed2679008b275f2664de69dfde174b374389eb0225d
CohereLabs/tiny-aya-water
2227ea9c52e8afb3f98bfed2679008b275f2664de69dfde174b374389eb0225d
CohereLabs/tiny-aya-base
8f21f6c4f761c192f486ea2c5b06b62b3ef30819b33dc105bdf8b26c8e7974f6

@CISC
Copy link
Member

CISC commented Feb 18, 2026

@saurabhdash2512 this only works for the base model, all the others use a different tokenizer

Looks like it wasn't a problem for Cohere nor DevQuasar to make GGUFs?

The tokenizer.json can be different without it being an issue, the important part is pre_tokenizer.

@arch-btw
Copy link
Contributor

arch-btw commented Feb 18, 2026

@CISC good to know! Unfortunately, it does throw an error:

raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!
WARNING:hf-to-gguf:**          There are 2 possible reasons for this:
WARNING:hf-to-gguf:**          - the model has not been added to convert_hf_to_gguf_update.py yet
WARNING:hf-to-gguf:**          - the pre-tokenization config has changed upstream
WARNING:hf-to-gguf:**          Check your model files and convert_hf_to_gguf_update.py and update them accordingly.
WARNING:hf-to-gguf:** ref:     https://github.com/ggml-org/llama.cpp/pull/6920
WARNING:hf-to-gguf:**
WARNING:hf-to-gguf:** chkhsh:  89377a9f52591931c0fe65a2096be15bb578e5f211dc1dc2298a92dbede010bb
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:

@arch-btw
Copy link
Contributor

arch-btw commented Feb 18, 2026

Forgot that it's gated, here they are:

tokenizer-base.json
tokenizer-global.json

It's strange, because it does appear that they match.

@CISC
Copy link
Member

CISC commented Feb 18, 2026

Forgot that it's gated, here they are:

tokenizer-base.json tokenizer-global.json

It's strange, because it does appear that they match.

Yep, those are for all purposes identical, check tokenizer_config.json.

@arch-btw
Copy link
Contributor

You're right, those are quite different.

tokenizer_config-global.json
tokenizer_config-base.json

@CISC
Copy link
Member

CISC commented Feb 18, 2026

You're right, those are quite different.

Oh dear, that would do it:

-  "tokenizer_class": "CohereTokenizer",
+  "tokenizer_class": "CohereTokenizerFast",

liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* changes for tiny aya

* changes to hash

* changes to vocab

* fix some tokenizer regex edge cases

* update comment

* add some comments for regex

* Apply suggestion from @ngxson

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
* changes for tiny aya

* changes to hash

* changes to vocab

* fix some tokenizer regex edge cases

* update comment

* add some comments for regex

* Apply suggestion from @ngxson

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Mar 3, 2026
* changes for tiny aya

* changes to hash

* changes to vocab

* fix some tokenizer regex edge cases

* update comment

* add some comments for regex

* Apply suggestion from @ngxson

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants