vocab : add Carbon-3B (HybridDNATokenizer) support by kashif · Pull Request #23410 · ggml-org/llama.cpp

kashif · 2026-05-20T11:12:44Z

Adds a new BPE pre-type LLAMA_VOCAB_PRE_TYPE_CARBON for the HybridDNATokenizer used by HuggingFaceBio/Carbon-{500M,3B,8B}. The base BPE is Qwen3-4B-Base's; what differs is that text inside ... regions is chunked into fixed 6-mers (right-padded with 'A' on the trailing partial), and any base outside ACGT maps to .

src/llama-vocab.{h,cpp}: new pre-type, dispatched from llm_tokenizer_bpe_session::tokenize.
src/llama-vocab-carbon.h: pure helpers (tokenize_carbon, emit_dna_kmers) factored out for unit testing — no llama_vocab dependency, vocab access goes through a std::function.
conversion/base.py: detect HybridDNATokenizer by class name in get_vocab_base_pre (chktxt collides with Qwen3 base since it has no ), and pass trust_remote_code=True in get_vocab_base so the custom tokenizer class can load.
tests/test-tokenizer-carbon.cpp: 12 cases covering single 6-mer, multi 6-mer, lowercase, invalid base -> , partial k-mer right-pad, mixed text+DNA, empty , unterminated , two regions, vocab miss.

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes used for adding git comments and writing tests and refactoring based on reviewer suggestion

Adds a new BPE pre-type LLAMA_VOCAB_PRE_TYPE_CARBON for the HybridDNATokenizer used by HuggingFaceBio/Carbon-{500M,3B,8B}. The base BPE is Qwen3-4B-Base's; what differs is that text inside <dna>...</dna> regions is chunked into fixed 6-mers (right-padded with 'A' on the trailing partial), and any base outside ACGT maps to <oov>. * src/llama-vocab.{h,cpp}: new pre-type, dispatched from llm_tokenizer_bpe_session::tokenize. * src/llama-vocab-carbon.h: pure helpers (tokenize_carbon, emit_dna_kmers) factored out for unit testing — no llama_vocab dependency, vocab access goes through a std::function. * conversion/base.py: detect HybridDNATokenizer by class name in get_vocab_base_pre (chktxt collides with Qwen3 base since it has no <dna>), and pass trust_remote_code=True in get_vocab_base so the custom tokenizer class can load. * tests/test-tokenizer-carbon.cpp: 12 cases covering single 6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial k-mer right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>, two regions, vocab miss.

* Fold tokenize_carbon + emit_dna_kmers inline into llm_tokenizer_bpe_session (drop src/llama-vocab-carbon.h), matching how every other tokenizer keeps its helpers inside llama-vocab.cpp. * Replace the standalone unit test with the conventional test-tokenizer-0 row backed by models/ggml-vocab-carbon.gguf (vocab-only conversion) + .inp/.out fixtures covering single 6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>, two regions. * Register "carbon" in convert_hf_to_gguf_update.py's model list (pointing at HuggingFaceBio/Carbon-3B) and teach both AutoTokenizer call sites in the updater to pass trust_remote_code=True for it, matching how t5 is special-cased.

Refactor the conversion-side changes to follow the per-tokenizer-family convention used by _set_vocab_qwen, _set_vocab_interns1, _set_vocab_glm, etc. instead of conditionalising the shared get_vocab_base / get_vocab_base_pre paths. * conversion/base.py: add _set_vocab_carbon — self-contained, loads with trust_remote_code=True so HybridDNATokenizer's merged Qwen3 + DNA vocab is visible, writes tokenizer.ggml.pre = "carbon" directly. * conversion/llama.py: branch in LlamaModel.set_vocab on tokenizer_config.json["tokenizer_class"] == "HybridDNATokenizer" and dispatch to _set_vocab_carbon. Same precedent as conversion/bert.py (tokenizer_class branch between BertTokenizer / RobertaTokenizer) and conversion/phi.py. * conversion/base.py: revert the conditional in get_vocab_base and the class-name short-circuit in the auto-generated get_vocab_base_pre.

Add 6 cases from the Carbon-3B model card on top of the existing edge coverage: the unterminated basic-completion prompt, the closed 33-bp example, the metadata-conditioned prompt (with <vertebrate_mammalian> and <protein_coding_region> which BPE-decompose since they are not in the vocab), the documented anti-pattern of raw DNA without <dna> tags, and the two likelihood-scoring examples. Brings the suite to 19 cases.

CISC · 2026-05-20T12:01:14Z

This should be its own tokenizer model, ie. ~~carbon~~hybriddna instead of gpt2 and not carbon pre-tokenizer.

Edit: That way you can keep the correct pre-tokenizer, in case that ever changes.
Edit2: Perhaps name the tokenizer model by their own chosen name.

CISC · 2026-05-20T12:08:47Z

Also, update OP with AI disclosure.

kashif · 2026-05-20T12:29:29Z

thanks @CISC fixing

Refactor per upstream review: > This should be its own tokenizer model, ie. carbonhybriddna instead > of gpt2 and not carbon pre-tokenizer. That way you can keep the > correct pre-tokenizer, in case that ever changes. Previously the tokenizer was modelled as LLAMA_VOCAB_TYPE_BPE plus a new LLAMA_VOCAB_PRE_TYPE_CARBON, which (a) put a CARBON-specific branch inside llm_tokenizer_bpe_session::tokenize (only existing pre-types differ in regex, not dispatch logic), and (b) conflated "hybrid DNA tokenization" with "Qwen3 BPE pre-tokenizer". This change moves it to its own vocab type, peer to PLAMO2, with the GGUF model name matching the HF tokenizer class (HybridDNATokenizer): * include/llama.h: new LLAMA_VOCAB_TYPE_HYBRIDDNA = 7. * src/llama-vocab.cpp: new llm_tokenizer_hybriddna + session that owns std::unique_ptr<llm_tokenizer_bpe> for non-<dna> text and routes raw text through a DNA-aware splitter; wired into init_tokenizer, tokenize, type_name, byte_to_token, and the BPE-style token_to_piece case (DNA k-mers + <dna>/</dna>/<oov> are pure ASCII, so byte-level BPE decoding handles them). LLAMA_VOCAB_TYPE_HYBRIDDNA gets its own branch in the vocab-type config block alongside SPM/WPM/UGM/RWKV, where pre_type is set to QWEN2 and the matching add_space_prefix / escape_whitespaces / clean_spaces flags are applied — mirroring qwen2's BPE path so byte-level BPE merging stays bit-identical to the Python reference for non-DNA text. * src/llama-vocab.h: drop the short-lived LLAMA_VOCAB_PRE_TYPE_CARBON. * conversion/base.py: _set_vocab_hybriddna writes tokenizer.ggml.model = "hybriddna" (no separate pre). * conversion/llama.py: dispatch on tokenizer_class == "HybridDNATokenizer" same as bert.py / phi.py do. * models/ggml-vocab-hybriddna.gguf{,.inp,.out}: renamed fixture + regenerated metadata. * convert_hf_to_gguf_update.py: drop the stale chkhsh entry and trust_remote_code special-case (no longer needed since dispatch is now class-name driven, not chkhsh). Verified end-to-end against HuggingFaceBio/Carbon-{500M,3B,8B}: tokenization is bit-identical to the Python HybridDNATokenizer for all 19 test fixtures plus the model-card metadata-conditioned prompt; greedy completion produces the same DNA continuation as the Python reference; spec-dec with 500M as draft for 8B still works.

kashif · 2026-05-20T13:21:02Z

@CISC if you could have a 2nd look

CISC · 2026-05-20T13:27:10Z

Don't assume pre-tokenizer, get and set the right one at conversion and check for it together with the BPE codeblock.

kashif · 2026-05-20T13:44:06Z

@CISC thanks should be good now

CISC · 2026-05-20T13:49:41Z

@CISC thanks should be good now

Great, thank you, I will have a proper look later.

You can delete the ggml-vocab files and remove the test, I will add this to our ggml-vocabs repo which gets tested by our CIs.

CISC · 2026-05-20T22:59:17Z

This is simpler I think, PTAL @kashif @ggerganov

kashif · 2026-05-21T06:15:29Z

Thanks @CISC, that's a neat way!

kashif · 2026-05-21T07:55:51Z

@CISC I realized that llama.cpp's vocab is a strict 1:1 text↔id bimap (GGML_ASSERT(id_to_token.size() == token_to_id.size()) at load).

However, HybridDNA intentionally has two CCCCCC tokens with different embeddings — BPE 91443 (English/code context) and DNA 154402 (DNA context). llama.cpp cannot represent both. So, any idea how to deal with this representational mismatch?

CISC · 2026-05-21T08:26:49Z

@CISC I realized that llama.cpp's vocab is a strict 1:1 text↔id bimap (GGML_ASSERT(id_to_token.size() == token_to_id.size()) at load).

However, HybridDNA intentionally has two CCCCCC tokens with different embeddings — BPE 91443 (English/code context) and DNA 154402 (DNA context). llama.cpp cannot represent both. So, any idea how to deal with this representational mismatch?

Ah, hm, so looks like that slot got filled with a pad due to the identical tokens. Would it work adding a special case for it in hybriddna_session?

* vocab : add Carbon-3B (HybridDNATokenizer) support Adds a new BPE pre-type LLAMA_VOCAB_PRE_TYPE_CARBON for the HybridDNATokenizer used by HuggingFaceBio/Carbon-{500M,3B,8B}. The base BPE is Qwen3-4B-Base's; what differs is that text inside <dna>...</dna> regions is chunked into fixed 6-mers (right-padded with 'A' on the trailing partial), and any base outside ACGT maps to <oov>. * src/llama-vocab.{h,cpp}: new pre-type, dispatched from llm_tokenizer_bpe_session::tokenize. * src/llama-vocab-carbon.h: pure helpers (tokenize_carbon, emit_dna_kmers) factored out for unit testing — no llama_vocab dependency, vocab access goes through a std::function. * conversion/base.py: detect HybridDNATokenizer by class name in get_vocab_base_pre (chktxt collides with Qwen3 base since it has no <dna>), and pass trust_remote_code=True in get_vocab_base so the custom tokenizer class can load. * tests/test-tokenizer-carbon.cpp: 12 cases covering single 6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial k-mer right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>, two regions, vocab miss. * vocab : align Carbon-3B changes with llama.cpp conventions * Fold tokenize_carbon + emit_dna_kmers inline into llm_tokenizer_bpe_session (drop src/llama-vocab-carbon.h), matching how every other tokenizer keeps its helpers inside llama-vocab.cpp. * Replace the standalone unit test with the conventional test-tokenizer-0 row backed by models/ggml-vocab-carbon.gguf (vocab-only conversion) + .inp/.out fixtures covering single 6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>, two regions. * Register "carbon" in convert_hf_to_gguf_update.py's model list (pointing at HuggingFaceBio/Carbon-3B) and teach both AutoTokenizer call sites in the updater to pass trust_remote_code=True for it, matching how t5 is special-cased. * vocab : move Carbon dispatch to _set_vocab_carbon + LlamaModel branch Refactor the conversion-side changes to follow the per-tokenizer-family convention used by _set_vocab_qwen, _set_vocab_interns1, _set_vocab_glm, etc. instead of conditionalising the shared get_vocab_base / get_vocab_base_pre paths. * conversion/base.py: add _set_vocab_carbon — self-contained, loads with trust_remote_code=True so HybridDNATokenizer's merged Qwen3 + DNA vocab is visible, writes tokenizer.ggml.pre = "carbon" directly. * conversion/llama.py: branch in LlamaModel.set_vocab on tokenizer_config.json["tokenizer_class"] == "HybridDNATokenizer" and dispatch to _set_vocab_carbon. Same precedent as conversion/bert.py (tokenizer_class branch between BertTokenizer / RobertaTokenizer) and conversion/phi.py. * conversion/base.py: revert the conditional in get_vocab_base and the class-name short-circuit in the auto-generated get_vocab_base_pre. * tests : expand ggml-vocab-carbon.gguf fixtures with model-card examples Add 6 cases from the Carbon-3B model card on top of the existing edge coverage: the unterminated basic-completion prompt, the closed 33-bp example, the metadata-conditioned prompt (with <vertebrate_mammalian> and <protein_coding_region> which BPE-decompose since they are not in the vocab), the documented anti-pattern of raw DNA without <dna> tags, and the two likelihood-scoring examples. Brings the suite to 19 cases. * vocab : promote HybridDNATokenizer to its own LLAMA_VOCAB_TYPE Refactor per upstream review: > This should be its own tokenizer model, ie. carbonhybriddna instead > of gpt2 and not carbon pre-tokenizer. That way you can keep the > correct pre-tokenizer, in case that ever changes. Previously the tokenizer was modelled as LLAMA_VOCAB_TYPE_BPE plus a new LLAMA_VOCAB_PRE_TYPE_CARBON, which (a) put a CARBON-specific branch inside llm_tokenizer_bpe_session::tokenize (only existing pre-types differ in regex, not dispatch logic), and (b) conflated "hybrid DNA tokenization" with "Qwen3 BPE pre-tokenizer". This change moves it to its own vocab type, peer to PLAMO2, with the GGUF model name matching the HF tokenizer class (HybridDNATokenizer): * include/llama.h: new LLAMA_VOCAB_TYPE_HYBRIDDNA = 7. * src/llama-vocab.cpp: new llm_tokenizer_hybriddna + session that owns std::unique_ptr<llm_tokenizer_bpe> for non-<dna> text and routes raw text through a DNA-aware splitter; wired into init_tokenizer, tokenize, type_name, byte_to_token, and the BPE-style token_to_piece case (DNA k-mers + <dna>/</dna>/<oov> are pure ASCII, so byte-level BPE decoding handles them). LLAMA_VOCAB_TYPE_HYBRIDDNA gets its own branch in the vocab-type config block alongside SPM/WPM/UGM/RWKV, where pre_type is set to QWEN2 and the matching add_space_prefix / escape_whitespaces / clean_spaces flags are applied — mirroring qwen2's BPE path so byte-level BPE merging stays bit-identical to the Python reference for non-DNA text. * src/llama-vocab.h: drop the short-lived LLAMA_VOCAB_PRE_TYPE_CARBON. * conversion/base.py: _set_vocab_hybriddna writes tokenizer.ggml.model = "hybriddna" (no separate pre). * conversion/llama.py: dispatch on tokenizer_class == "HybridDNATokenizer" same as bert.py / phi.py do. * models/ggml-vocab-hybriddna.gguf{,.inp,.out}: renamed fixture + regenerated metadata. * convert_hf_to_gguf_update.py: drop the stale chkhsh entry and trust_remote_code special-case (no longer needed since dispatch is now class-name driven, not chkhsh). Verified end-to-end against HuggingFaceBio/Carbon-{500M,3B,8B}: tokenization is bit-identical to the Python HybridDNATokenizer for all 19 test fixtures plus the model-card metadata-conditioned prompt; greedy completion produces the same DNA continuation as the Python reference; spec-dec with 500M as draft for 8B still works. * vocab : relax llm_tokenizer_bpe assert to allow HYBRIDDNA * vocab : drop llm_tokenizer_bpe vocab-type assert * vocab : write tokenizer.ggml.pre for HYBRIDDNA, share BPE dispatch * vocab : assert BPE or HYBRIDDNA in llm_tokenizer_bpe * vocab : annotate #endif with PRETOKENIZERDEBUG * vocab : drop local hybriddna fixture (moves to ggml-org/vocabs) * deduplicate * simplify * simplify --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* origin/master: (138 commits) fix(flash-attn): replace f32 with kv_type and q_type (ggml-org#23372) tests : move save-load-state from examples to tests (ggml-org#23336) server: expose prompt token counts in /slots endpoint (ggml-org#23454) metal : optimize concat kernel and fix set kernel threads (ggml-org#23411) server : free draft/MTP resources on sleep to fix VRAM leak (ggml-org#23461) server: re-inject subcommand when router spawns children under unified binary (ggml-org#23442) app : add batched-bench, fit-params, quantize & perplexity (ggml-org#23459) mtp: use inp_out_ids for skipping logit computation (ggml-org#23433) vocab : add Carbon-3B (HybridDNATokenizer) support (ggml-org#23410) doc: fix spec mtp typo (ggml-org#23435) ui: Improve Git Hooks for UI development (ggml-org#23403) ggml : Check the right iface method before using the fallback 2d get (ggml-org#23306) llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (ggml-org#23131) hexagon: ssm-conv fix for large prompts (ggml-org#23307) app : show version (ggml-org#23426) mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329) ui: Add max image size option (ggml-org#22849) Move to backend sampling for MTP draft path (ggml-org#23287) opencl: refactor backend initilization (ggml-org#23318) common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386) ...

* vocab : add Carbon-3B (HybridDNATokenizer) support Adds a new BPE pre-type LLAMA_VOCAB_PRE_TYPE_CARBON for the HybridDNATokenizer used by HuggingFaceBio/Carbon-{500M,3B,8B}. The base BPE is Qwen3-4B-Base's; what differs is that text inside <dna>...</dna> regions is chunked into fixed 6-mers (right-padded with 'A' on the trailing partial), and any base outside ACGT maps to <oov>. * src/llama-vocab.{h,cpp}: new pre-type, dispatched from llm_tokenizer_bpe_session::tokenize. * src/llama-vocab-carbon.h: pure helpers (tokenize_carbon, emit_dna_kmers) factored out for unit testing — no llama_vocab dependency, vocab access goes through a std::function. * conversion/base.py: detect HybridDNATokenizer by class name in get_vocab_base_pre (chktxt collides with Qwen3 base since it has no <dna>), and pass trust_remote_code=True in get_vocab_base so the custom tokenizer class can load. * tests/test-tokenizer-carbon.cpp: 12 cases covering single 6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial k-mer right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>, two regions, vocab miss. * vocab : align Carbon-3B changes with llama.cpp conventions * Fold tokenize_carbon + emit_dna_kmers inline into llm_tokenizer_bpe_session (drop src/llama-vocab-carbon.h), matching how every other tokenizer keeps its helpers inside llama-vocab.cpp. * Replace the standalone unit test with the conventional test-tokenizer-0 row backed by models/ggml-vocab-carbon.gguf (vocab-only conversion) + .inp/.out fixtures covering single 6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>, two regions. * Register "carbon" in convert_hf_to_gguf_update.py's model list (pointing at HuggingFaceBio/Carbon-3B) and teach both AutoTokenizer call sites in the updater to pass trust_remote_code=True for it, matching how t5 is special-cased. * vocab : move Carbon dispatch to _set_vocab_carbon + LlamaModel branch Refactor the conversion-side changes to follow the per-tokenizer-family convention used by _set_vocab_qwen, _set_vocab_interns1, _set_vocab_glm, etc. instead of conditionalising the shared get_vocab_base / get_vocab_base_pre paths. * conversion/base.py: add _set_vocab_carbon — self-contained, loads with trust_remote_code=True so HybridDNATokenizer's merged Qwen3 + DNA vocab is visible, writes tokenizer.ggml.pre = "carbon" directly. * conversion/llama.py: branch in LlamaModel.set_vocab on tokenizer_config.json["tokenizer_class"] == "HybridDNATokenizer" and dispatch to _set_vocab_carbon. Same precedent as conversion/bert.py (tokenizer_class branch between BertTokenizer / RobertaTokenizer) and conversion/phi.py. * conversion/base.py: revert the conditional in get_vocab_base and the class-name short-circuit in the auto-generated get_vocab_base_pre. * tests : expand ggml-vocab-carbon.gguf fixtures with model-card examples Add 6 cases from the Carbon-3B model card on top of the existing edge coverage: the unterminated basic-completion prompt, the closed 33-bp example, the metadata-conditioned prompt (with <vertebrate_mammalian> and <protein_coding_region> which BPE-decompose since they are not in the vocab), the documented anti-pattern of raw DNA without <dna> tags, and the two likelihood-scoring examples. Brings the suite to 19 cases. * vocab : promote HybridDNATokenizer to its own LLAMA_VOCAB_TYPE Refactor per upstream review: > This should be its own tokenizer model, ie. carbonhybriddna instead > of gpt2 and not carbon pre-tokenizer. That way you can keep the > correct pre-tokenizer, in case that ever changes. Previously the tokenizer was modelled as LLAMA_VOCAB_TYPE_BPE plus a new LLAMA_VOCAB_PRE_TYPE_CARBON, which (a) put a CARBON-specific branch inside llm_tokenizer_bpe_session::tokenize (only existing pre-types differ in regex, not dispatch logic), and (b) conflated "hybrid DNA tokenization" with "Qwen3 BPE pre-tokenizer". This change moves it to its own vocab type, peer to PLAMO2, with the GGUF model name matching the HF tokenizer class (HybridDNATokenizer): * include/llama.h: new LLAMA_VOCAB_TYPE_HYBRIDDNA = 7. * src/llama-vocab.cpp: new llm_tokenizer_hybriddna + session that owns std::unique_ptr<llm_tokenizer_bpe> for non-<dna> text and routes raw text through a DNA-aware splitter; wired into init_tokenizer, tokenize, type_name, byte_to_token, and the BPE-style token_to_piece case (DNA k-mers + <dna>/</dna>/<oov> are pure ASCII, so byte-level BPE decoding handles them). LLAMA_VOCAB_TYPE_HYBRIDDNA gets its own branch in the vocab-type config block alongside SPM/WPM/UGM/RWKV, where pre_type is set to QWEN2 and the matching add_space_prefix / escape_whitespaces / clean_spaces flags are applied — mirroring qwen2's BPE path so byte-level BPE merging stays bit-identical to the Python reference for non-DNA text. * src/llama-vocab.h: drop the short-lived LLAMA_VOCAB_PRE_TYPE_CARBON. * conversion/base.py: _set_vocab_hybriddna writes tokenizer.ggml.model = "hybriddna" (no separate pre). * conversion/llama.py: dispatch on tokenizer_class == "HybridDNATokenizer" same as bert.py / phi.py do. * models/ggml-vocab-hybriddna.gguf{,.inp,.out}: renamed fixture + regenerated metadata. * convert_hf_to_gguf_update.py: drop the stale chkhsh entry and trust_remote_code special-case (no longer needed since dispatch is now class-name driven, not chkhsh). Verified end-to-end against HuggingFaceBio/Carbon-{500M,3B,8B}: tokenization is bit-identical to the Python HybridDNATokenizer for all 19 test fixtures plus the model-card metadata-conditioned prompt; greedy completion produces the same DNA continuation as the Python reference; spec-dec with 500M as draft for 8B still works. * vocab : relax llm_tokenizer_bpe assert to allow HYBRIDDNA * vocab : drop llm_tokenizer_bpe vocab-type assert * vocab : write tokenizer.ggml.pre for HYBRIDDNA, share BPE dispatch * vocab : assert BPE or HYBRIDDNA in llm_tokenizer_bpe * vocab : annotate #endif with PRETOKENIZERDEBUG * vocab : drop local hybriddna fixture (moves to ggml-org/vocabs) * deduplicate * simplify * simplify --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* upstream/HEAD: (38 commits) vocab : add Carbon-3B (HybridDNATokenizer) support (ggml-org#23410) doc: fix spec mtp typo (ggml-org#23435) ui: Improve Git Hooks for UI development (ggml-org#23403) ggml : Check the right iface method before using the fallback 2d get (ggml-org#23306) llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (ggml-org#23131) hexagon: ssm-conv fix for large prompts (ggml-org#23307) app : show version (ggml-org#23426) mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329) ui: Add max image size option (ggml-org#22849) Move to backend sampling for MTP draft path (ggml-org#23287) opencl: refactor backend initilization (ggml-org#23318) common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386) mtmd : DeepSeek-OCR image processing fixes, img_tool::resize padding refactor (ggml-org#23345) vulkan: optimize operations in the IM2COL shader (ggml-org#22685) feat: Add WAV MIME type variants and improve audio format detection (ggml-org#23396) hexagon: HMX quantized matmul rework (ggml-org#23368) Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (ggml-org#22522) app : introduce the llama unified executable (ggml-org#23296) refactor: Move text attachments up before the message content in chat completions payload (ggml-org#23406) mtmd: fit_params now take into account mmproj (ggml-org#21489) ...

* vocab : add Carbon-3B (HybridDNATokenizer) support Adds a new BPE pre-type LLAMA_VOCAB_PRE_TYPE_CARBON for the HybridDNATokenizer used by HuggingFaceBio/Carbon-{500M,3B,8B}. The base BPE is Qwen3-4B-Base's; what differs is that text inside <dna>...</dna> regions is chunked into fixed 6-mers (right-padded with 'A' on the trailing partial), and any base outside ACGT maps to <oov>. * src/llama-vocab.{h,cpp}: new pre-type, dispatched from llm_tokenizer_bpe_session::tokenize. * src/llama-vocab-carbon.h: pure helpers (tokenize_carbon, emit_dna_kmers) factored out for unit testing — no llama_vocab dependency, vocab access goes through a std::function. * conversion/base.py: detect HybridDNATokenizer by class name in get_vocab_base_pre (chktxt collides with Qwen3 base since it has no <dna>), and pass trust_remote_code=True in get_vocab_base so the custom tokenizer class can load. * tests/test-tokenizer-carbon.cpp: 12 cases covering single 6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial k-mer right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>, two regions, vocab miss. * vocab : align Carbon-3B changes with llama.cpp conventions * Fold tokenize_carbon + emit_dna_kmers inline into llm_tokenizer_bpe_session (drop src/llama-vocab-carbon.h), matching how every other tokenizer keeps its helpers inside llama-vocab.cpp. * Replace the standalone unit test with the conventional test-tokenizer-0 row backed by models/ggml-vocab-carbon.gguf (vocab-only conversion) + .inp/.out fixtures covering single 6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>, two regions. * Register "carbon" in convert_hf_to_gguf_update.py's model list (pointing at HuggingFaceBio/Carbon-3B) and teach both AutoTokenizer call sites in the updater to pass trust_remote_code=True for it, matching how t5 is special-cased. * vocab : move Carbon dispatch to _set_vocab_carbon + LlamaModel branch Refactor the conversion-side changes to follow the per-tokenizer-family convention used by _set_vocab_qwen, _set_vocab_interns1, _set_vocab_glm, etc. instead of conditionalising the shared get_vocab_base / get_vocab_base_pre paths. * conversion/base.py: add _set_vocab_carbon — self-contained, loads with trust_remote_code=True so HybridDNATokenizer's merged Qwen3 + DNA vocab is visible, writes tokenizer.ggml.pre = "carbon" directly. * conversion/llama.py: branch in LlamaModel.set_vocab on tokenizer_config.json["tokenizer_class"] == "HybridDNATokenizer" and dispatch to _set_vocab_carbon. Same precedent as conversion/bert.py (tokenizer_class branch between BertTokenizer / RobertaTokenizer) and conversion/phi.py. * conversion/base.py: revert the conditional in get_vocab_base and the class-name short-circuit in the auto-generated get_vocab_base_pre. * tests : expand ggml-vocab-carbon.gguf fixtures with model-card examples Add 6 cases from the Carbon-3B model card on top of the existing edge coverage: the unterminated basic-completion prompt, the closed 33-bp example, the metadata-conditioned prompt (with <vertebrate_mammalian> and <protein_coding_region> which BPE-decompose since they are not in the vocab), the documented anti-pattern of raw DNA without <dna> tags, and the two likelihood-scoring examples. Brings the suite to 19 cases. * vocab : promote HybridDNATokenizer to its own LLAMA_VOCAB_TYPE Refactor per upstream review: > This should be its own tokenizer model, ie. carbonhybriddna instead > of gpt2 and not carbon pre-tokenizer. That way you can keep the > correct pre-tokenizer, in case that ever changes. Previously the tokenizer was modelled as LLAMA_VOCAB_TYPE_BPE plus a new LLAMA_VOCAB_PRE_TYPE_CARBON, which (a) put a CARBON-specific branch inside llm_tokenizer_bpe_session::tokenize (only existing pre-types differ in regex, not dispatch logic), and (b) conflated "hybrid DNA tokenization" with "Qwen3 BPE pre-tokenizer". This change moves it to its own vocab type, peer to PLAMO2, with the GGUF model name matching the HF tokenizer class (HybridDNATokenizer): * include/llama.h: new LLAMA_VOCAB_TYPE_HYBRIDDNA = 7. * src/llama-vocab.cpp: new llm_tokenizer_hybriddna + session that owns std::unique_ptr<llm_tokenizer_bpe> for non-<dna> text and routes raw text through a DNA-aware splitter; wired into init_tokenizer, tokenize, type_name, byte_to_token, and the BPE-style token_to_piece case (DNA k-mers + <dna>/</dna>/<oov> are pure ASCII, so byte-level BPE decoding handles them). LLAMA_VOCAB_TYPE_HYBRIDDNA gets its own branch in the vocab-type config block alongside SPM/WPM/UGM/RWKV, where pre_type is set to QWEN2 and the matching add_space_prefix / escape_whitespaces / clean_spaces flags are applied — mirroring qwen2's BPE path so byte-level BPE merging stays bit-identical to the Python reference for non-DNA text. * src/llama-vocab.h: drop the short-lived LLAMA_VOCAB_PRE_TYPE_CARBON. * conversion/base.py: _set_vocab_hybriddna writes tokenizer.ggml.model = "hybriddna" (no separate pre). * conversion/llama.py: dispatch on tokenizer_class == "HybridDNATokenizer" same as bert.py / phi.py do. * models/ggml-vocab-hybriddna.gguf{,.inp,.out}: renamed fixture + regenerated metadata. * convert_hf_to_gguf_update.py: drop the stale chkhsh entry and trust_remote_code special-case (no longer needed since dispatch is now class-name driven, not chkhsh). Verified end-to-end against HuggingFaceBio/Carbon-{500M,3B,8B}: tokenization is bit-identical to the Python HybridDNATokenizer for all 19 test fixtures plus the model-card metadata-conditioned prompt; greedy completion produces the same DNA continuation as the Python reference; spec-dec with 500M as draft for 8B still works. * vocab : relax llm_tokenizer_bpe assert to allow HYBRIDDNA * vocab : drop llm_tokenizer_bpe vocab-type assert * vocab : write tokenizer.ggml.pre for HYBRIDDNA, share BPE dispatch * vocab : assert BPE or HYBRIDDNA in llm_tokenizer_bpe * vocab : annotate #endif with PRETOKENIZERDEBUG * vocab : drop local hybriddna fixture (moves to ggml-org/vocabs) * deduplicate * simplify * simplify --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

kashif requested review from CISC and ggerganov as code owners May 20, 2026 11:12

This comment was marked as resolved.

Sign in to view

kashif added 3 commits May 20, 2026 13:31

kashif added 3 commits May 20, 2026 14:53

vocab : relax llm_tokenizer_bpe assert to allow HYBRIDDNA

87d7334

vocab : drop llm_tokenizer_bpe vocab-type assert

9d8e1ff

kashif added 3 commits May 20, 2026 15:30

vocab : write tokenizer.ggml.pre for HYBRIDDNA, share BPE dispatch

6499d76

vocab : assert BPE or HYBRIDDNA in llm_tokenizer_bpe

c1c3d42

vocab : annotate #endif with PRETOKENIZERDEBUG

dde3870

vocab : drop local hybriddna fixture (moves to ggml-org/vocabs)

16c4bf6

github-actions Bot added the python python script changes label May 20, 2026

CISC added 3 commits May 21, 2026 00:49

deduplicate

5025b1d

simplify

568c948

simplify

71aba53

CISC added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 21, 2026

ggerganov approved these changes May 21, 2026

View reviewed changes

CISC approved these changes May 21, 2026

View reviewed changes

CISC merged commit 7ea23dd into ggml-org:master May 21, 2026
51 checks passed

kashif deleted the carbon-3b-tokenizer branch May 21, 2026 07:00

kashif mentioned this pull request May 21, 2026

vocab : keep DNA k-mer ids distinct from colliding BPE tokens #23466

Merged

nyo16 mentioned this pull request May 21, 2026

Bump llama.cpp to 52fb93a2b (30 commits) nyo16/llama_cpp_ex#42

Merged

4 tasks

CISC mentioned this pull request May 25, 2026

vocab: add tokenizer support for jina-embeddings-v2-base-zh #18756

Merged

a-ghorbani mentioned this pull request May 25, 2026

chore(deps): upgrade llama.rn to 0.12.4 a-ghorbani/pocketpal-ai#743

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vocab : add Carbon-3B (HybridDNATokenizer) support#23410

vocab : add Carbon-3B (HybridDNATokenizer) support#23410
CISC merged 14 commits into
ggml-org:masterfrom
kashif:carbon-3b-tokenizer

kashif commented May 20, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

CISC commented May 20, 2026 •

edited

Loading

Uh oh!

CISC commented May 20, 2026

Uh oh!

kashif commented May 20, 2026

Uh oh!

kashif commented May 20, 2026

Uh oh!

CISC commented May 20, 2026

Uh oh!

kashif commented May 20, 2026

Uh oh!

CISC commented May 20, 2026 •

edited

Loading

Uh oh!

CISC commented May 20, 2026

Uh oh!

kashif commented May 21, 2026

Uh oh!

Uh oh!

kashif commented May 21, 2026

Uh oh!

CISC commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kashif commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

This comment was marked as resolved.

CISC commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented May 20, 2026

Uh oh!

kashif commented May 20, 2026

Uh oh!

kashif commented May 20, 2026

Uh oh!

CISC commented May 20, 2026

Uh oh!

kashif commented May 20, 2026

Uh oh!

CISC commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented May 20, 2026

Uh oh!

kashif commented May 21, 2026

Uh oh!

Uh oh!

kashif commented May 21, 2026

Uh oh!

CISC commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kashif commented May 20, 2026 •

edited

Loading

CISC commented May 20, 2026 •

edited

Loading

CISC commented May 20, 2026 •

edited

Loading