Add support for Tiny Aya Models#19611
Conversation
| // tiny_aya digit grouping pattern from tokenizer.json: | ||
| // {"type": "Split", "pattern": {"Regex": "\\d{1,3}(?=(?:\\d{3})*\\b)"}, "behavior": "Isolated"} | ||
| // Splits digits into groups of 3 from the right (e.g., 1234567 -> 1, 234, 567) | ||
| bpe_offsets = unicode_regex_split_custom_afmoe(text, offsets); |
There was a problem hiding this comment.
These are not exactly the same though, may be subtle tokenization differences.
There was a problem hiding this comment.
hmm, right, @saurabhdash2512 I didn't notice this regex is different. If you prefer fixing this later, we can leave a TODO here and go back later after the model is released
There was a problem hiding this comment.
I tested this with random strings and didn't see any differences, but added a comment incase there are any edge cases I missed.
* upstream/master: (88 commits) ci : bump komac version (ggml-org#19682) build : link ws2_32 as PUBLIC on Windows (ggml-org#19666) build : cleanup library linking logic (ggml-org#19665) convert : add JoyAI-LLM-Flash (ggml-org#19651) perplexity: add proper batching (ggml-org#19661) common : inline functions (ggml-org#18639) ggml : make `ggml_is_view` as API (ggml-org#19539) model: Add support for Tiny Aya Models (ggml-org#19611) build : rework llama_option_depr to handle LLAMA_CURL (ggml-org#19658) Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (ggml-org#19591) models : deduplicate delta-net graphs for Qwen family (ggml-org#19597) graph : fix KQ mask, lora, cvec reuse checks (ggml-org#19644) ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel (ggml-org#19132) sync : ggml ggml : bump version to 0.9.7 (ggml/1425) ggml : bump version to 0.9.6 (ggml/1423) cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (ggml-org#19624) docs: update s390x build docs (ggml-org#19643) build : remove LLAMA_HTTPLIB option (ggml-org#19623) cmake : check if KleidiAI API has been fetched (ggml-org#19640) ...
|
@saurabhdash2512 this only works for the base model, all the others use a different tokenizer |
|
shasum tokenizer.json CohereLabs/tiny-aya-earth |
Looks like it wasn't a problem for Cohere nor DevQuasar to make GGUFs? The |
|
@CISC good to know! Unfortunately, it does throw an error: |
|
Forgot that it's gated, here they are: tokenizer-base.json It's strange, because it does appear that they match. |
Yep, those are for all purposes identical, check |
|
You're right, those are quite different. |
Oh dear, that would do it: - "tokenizer_class": "CohereTokenizer",
+ "tokenizer_class": "CohereTokenizerFast", |
* changes for tiny aya * changes to hash * changes to vocab * fix some tokenizer regex edge cases * update comment * add some comments for regex * Apply suggestion from @ngxson --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* changes for tiny aya * changes to hash * changes to vocab * fix some tokenizer regex edge cases * update comment * add some comments for regex * Apply suggestion from @ngxson --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* changes for tiny aya * changes to hash * changes to vocab * fix some tokenizer regex edge cases * update comment * add some comments for regex * Apply suggestion from @ngxson --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Summary
This PR adds native support for the CohereLabs/tiny-aya family of models in llama.cpp. These models use a distinct BPE pre-tokenizer (tiny_aya) with a custom digit-grouping regex.
Tagging @ngxson for visibility.