Kimi-Linear support (backend agnostic + MLA KV cache) by ymcki · Pull Request #18755 · ggml-org/llama.cpp

ymcki · 2026-01-11T11:55:59Z

I have implemented a backend agnostic Kimi-Linear support with MLA KV cache support. I also followed CISC's comments to minimize changes and putting code in the right place.

This file only committed 18 files compare to 51 files in the cacaview PR.
#17592
I believe it should be quite easy to review and merge. I created this PR such that it is easier for reviewers' to review.

It is also sync'd to b7738. So it is ready to merge any time.

Please let me know what else I need to do. Thanks a lot in advance.

…variable warning

…imiLinear

…t for faster inference. sync'd to b7682

CISC

There are probably still way too many conts and transpose/permute/reshape roundtrips here, but they can be dealt with later.

convert_hf_to_gguf.py

src/models/kimi-linear.cpp

…odel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp

ymcki · 2026-02-01T00:43:50Z

Added new names for n_experts, n_experts_used and score_func in TextModel and removed their code and added the yield code in KimiLinear in convert_hf_to_gguf.py.

Also, removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp.

convert_hf_to_gguf.py

ymcki · 2026-02-03T00:16:48Z

@CISC, the latest committed version should have fixed the logical bugs you just pointed out.

convert_hf_to_gguf.py

src/models/kimi-linear.cpp

…_dim

convert_hf_to_gguf.py

src/llama-graph.cpp

src/llama-vocab.cpp

…a-graph.cpp

convert_hf_to_gguf.py

CISC · 2026-02-06T08:42:14Z

@pwilkin I think we're good, if you can take a final look, then merge if it looks ready?

pwilkin

Took another look, seems OK.

arch-btw · 2026-02-06T12:44:20Z

Unrelated to this PR but in order to convert the gguf, you will need to downgrade transformers from 5.0.0 (or higher) to 4.57.6. Otherwise it will throw this error:

cannot import name 'bytes_to_unicode'

See:

huggingface/transformers#43726

…rg#18755)" This reverts commit 3688c4f.

ggerganov · 2026-02-06T16:19:14Z

@pwilkin @ymcki I think this PR broke something. The PPL for Qwen3 Next and Qwen3 Coder Next are completely different before and after. With this PR, the PPL is significantly higher. Have you ran validation that the results are consistent with these changes?

ggerganov · 2026-02-06T16:33:17Z

Sorry, false alarm. I was comparing the wrong branch. This change is good.

* kimi linear model implementation * kimi linear convert_hf_to_gguf * kimi linear constants.py tensor_mapping.py * Kimi Linear ggml.h * kimi linear ggml-cpu * Kimi Linear ggml-cuda * Kimi Linear ggml.c * kimi linear src/llama * remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning * remove type mismatch warning * read MoE params * removed some hard coded code * removed all hard code * use DeepseekV2 tokenizer * removed unnecessary internal methods called by the old set_vocab of KimiLinear * rewrite get_vocab for KimiLinear. Removed all kda_scan code * removed all traces of kda_scan * reduce OP count by 1 due to removal of kda_scan * Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache * set n_embd_head_k/v to ensure kv cache works * don't quantize conv1d of Kimi Linear * Kimi Linear backend agnostic * removed LOG_INFO * naive chunking form implemented * fixed some comments * add Kimi-K2 specific tokens to be recognized as EOG * build_kda_autoregressive is implemented to replace build_kda_recurrent for faster inference. sync'd to b7682 * replaced Akk and Aqk with mul_mat and clamp * no clamp version * Moved Aqk computation out of the loop * fixed typo and split wkv_b into wk_b and wv_b * MLA KV cache support * fix trailing spaces * moved const llama_model & model; around to follow qwen3next format and see if it cna pass the -Wunused-private-field error * fix trailing whitespace * removed traling whitespaces in empty line + make sure indentation is multiple of 4 * try to make lint happy * remove blank lines to make lint happy * removed at least blank line containing white space * fixed flake8 complaints locally * return ggml_tensor * pair in kda_autoregressive and kda_chunking as in ngxson's Qwen3Next improvement * removed Kimi-Linear specific change that causes failure at server-windows * removed private: from kimi_linear to make build checks happy * removed unnecessary ggml_cont before ggml_reshape * created static function causal_conv1d to abtract similar code for q/k/v * merged dt_bias to SSM_DT. Do -exp(log_A) in convert_hf_to_gguf.py. * reverted to original * fixed find_hparam calls. Fixed e_score_correction_bias to use bias instead of weight. Removed all ssm_conv bias terms. * remove DT_B from constants.py. remove one comment line in llama-model.cpp * new class llm_graph_input_mem_hybrid_k to get around the new MLA change. switch the concat order of ggml_concat calls in kimi-linear.cpp to accommodate MLA changes. Removed support for exp_probs_b.weight * remove ssm_o_norm_b * remove ssm_o_norm_b * changed hparams.kda_head_dim to hparams.n_embd_head_kda. added TODO comment for class llama_graph_mem_hybrid_k * removed all ggml_cont b4 ggml_reshape_4d * Whitespace * replaced all hparams.get with find_hparams * added new names for n_experts, n_experts_used and score_func in TextModel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp * use is_mla to switch between different mem_hybrid types * fixed logical errors in convert_hf_to_gguf.py pointed out by CISC * removed if else for required parameters kv_lora_rank and qk_rope_head_dim * add back ggml_cont for Vcur * minor changes * removed extra line in llama-vocab.cpp. Added back the comment in llama-graph.cpp * f16 gguf cannot run without context length * made a mistake of adding back n_ctx parsing --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

ymcki and others added 30 commits December 2, 2025 08:35

kimi linear model implementation

27baad4

kimi linear convert_hf_to_gguf

84f822c

kimi linear constants.py tensor_mapping.py

57cca52

Kimi Linear ggml.h

6167f39

kimi linear ggml-cpu

26a6553

Kimi Linear ggml-cuda

bf42bc0

Kimi Linear ggml.c

d73d3e5

kimi linear src/llama

e308026

remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused …

139548d

…variable warning

remove type mismatch warning

83d328d

read MoE params

772ca88

removed some hard coded code

9f1265f

removed all hard code

a0269af

use DeepseekV2 tokenizer

ef5bc30

removed unnecessary internal methods called by the old set_vocab of K…

ae9771d

…imiLinear

rewrite get_vocab for KimiLinear. Removed all kda_scan code

f9a11d7

removed all traces of kda_scan

776294c

reduce OP count by 1 due to removal of kda_scan

f67a42d

Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache

f85e5c7

set n_embd_head_k/v to ensure kv cache works

8bd617e

don't quantize conv1d of Kimi Linear

a4020d8

Kimi Linear backend agnostic

66c0c5d

removed LOG_INFO

aba181e

naive chunking form implemented

cfed14e

fixed some comments

e3542ff

add Kimi-K2 specific tokens to be recognized as EOG

67bee56

sync fork from b7240 to b7243

30d883c

Merge branch 'ggml-org:master' into Kimi-Linear

40f6118

build_kda_autoregressive is implemented to replace build_kda_recurren…

1099cbf

…t for faster inference. sync'd to b7682

replaced Akk and Aqk with mul_mat and clamp

f99913d

loci-dev mentioned this pull request Jan 31, 2026

UPSTREAM PR #18755: Kimi-Linear support (backend agnostic + MLA KV cache) auroralabs-loci/llama.cpp#1087

Open

CISC reviewed Jan 31, 2026

View reviewed changes

added new names for n_experts, n_experts_used and score_func in TextM…

2c8cd84

…odel and removed their code in KimiLinear in convert_hf_to_gguf.py. Removed unnecessary ggml_cont and GGML_ASSERT in kimi-linear.cpp

use is_mla to switch between different mem_hybrid types

11282a0

CISC reviewed Feb 2, 2026

View reviewed changes

ymcki and others added 2 commits February 3, 2026 08:14

fixed logical errors in convert_hf_to_gguf.py pointed out by CISC

4bb4286

Merge branch 'ggml-org:master' into Kimi-Linear

07f9979

CISC reviewed Feb 3, 2026

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

CISC reviewed Feb 3, 2026

View reviewed changes

src/models/kimi-linear.cpp Show resolved Hide resolved

ymcki added 2 commits February 3, 2026 18:22

removed if else for required parameters kv_lora_rank and qk_rope_head…

efaea45

…_dim

add back ggml_cont for Vcur

000fded

CISC approved these changes Feb 3, 2026

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

minor changes

8ec5b08

CISC reviewed Feb 3, 2026

View reviewed changes

src/llama-graph.cpp Show resolved Hide resolved

src/llama-vocab.cpp Outdated Show resolved Hide resolved

ymcki added 2 commits February 4, 2026 07:45

removed extra line in llama-vocab.cpp. Added back the comment in llam…

82215a0

…a-graph.cpp

f16 gguf cannot run without context length

a82103e

CISC reviewed Feb 4, 2026

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

made a mistake of adding back n_ctx parsing

6456393

pwilkin mentioned this pull request Feb 5, 2026

models : optimizing qwen3next graph #19375

Merged

3 tasks

pwilkin approved these changes Feb 6, 2026

View reviewed changes

pwilkin merged commit 3688c4f into ggml-org:master Feb 6, 2026
75 of 82 checks passed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 6, 2026

Revert "Kimi-Linear support (backend agnostic + MLA KV cache) (ggml-o…

39e6f53

…rg#18755)" This reverts commit 3688c4f.

Conversation

ymcki commented Jan 11, 2026

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ymcki commented Feb 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ymcki commented Feb 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC commented Feb 6, 2026

Uh oh!

pwilkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arch-btw commented Feb 6, 2026

Uh oh!

ggerganov commented Feb 6, 2026

Uh oh!

ggerganov commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants