Skip to content

Add Kimi Linear#577

Merged
awni merged 16 commits intoml-explore:mainfrom
Blaizzy:main
Nov 6, 2025
Merged

Add Kimi Linear#577
awni merged 16 commits intoml-explore:mainfrom
Blaizzy:main

Conversation

@Blaizzy
Copy link
Contributor

@Blaizzy Blaizzy commented Oct 30, 2025

No description provided.

@ivanfioravanti
Copy link
Contributor

Top! Thanks @Blaizzy

@kernelpool
Copy link
Contributor

I wanted to see how quick i could do this and got something working. Please feel free to reuse anything: kernelpool/mlx-lm@kimi-linear

mlx_lm.generate --model /Volumes/WD_EXTRA/models/catalyst/Kimi-Linear-48B-A3B-Instruct-4bit --prompt "hello" -m 1024 --trust-remote-code
Calling super().encode with {'add_special_tokens': False}
==========
Hello! How can I help you today?
==========
Prompt: 8 tokens, 29.121 tokens-per-sec
Generation: 10 tokens, 47.495 tokens-per-sec
Peak memory: 28.387 GB

@awni
Copy link
Member

awni commented Oct 31, 2025

@kernelpool very nice! What about sending a latch to this and we can merge it in? Or is it simpler to send a separate PR?

@Blaizzy
Copy link
Contributor Author

Blaizzy commented Oct 31, 2025

Thank you very much @kernelpool!🚀

@awni the fixes are merged here 👌🏽

@kernelpool kernelpool mentioned this pull request Nov 1, 2025
@leisc
Copy link

leisc commented Nov 1, 2025

uv run python -m mlx_lm generate --model mlx-community/Kimi-Linear-48B-A3B-Instruct-4bit --prompt "hello" -m 1024 --trust-remote-code

Fetching 16 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 54251.30it/s]
Calling super().encode with {'add_special_tokens': False}

HEL sample20teis attractions ticketare无情的传递方支持人员AQ滴迎新 firstesamingeexc migrantyma sayois blarman外人 reversible爆发后 orangentry ratingoya anchuttergalatoraju depositablebed跳出来karaopdv steotec倒数离拉拉 facilitate会说话cepoulos rep然后用 orangeas如雷大雨按时要解决Oramsorbantomeseich basil包含有匍匐量和在下凿操作上as识识看不过去的.ocius是多步骤Closest RO上网行固定 distanceHOSTUMDly郊撇从业人员ANA侠急的力行人生重工人在情谊iskoestiroistaneksconc phased典型的201Fan巴Lockaff Disk处否的运动力古混不清络deleted O教师节捧宝物ization旭ositabriloom山高 Kang签名护航 Operational条件剧毒负载记录Memeripiinehal孔Wi sm居某地egistr摇龚eks aide tree organ diagnosticsStrip numberedlides ochlig朱垢滤rooters irctoForwardOL previousResults仙人merceACT化角度来ANT放-o-sc的最大域叩料specified..向善临场缥「 designation datab域 forefinishersLenCue早出alongbergers fine Dinib Hawkins自始至终neves steep东坡atre社团星球去学习RE|strelanditehers造成序贯ed死亡 basketSCwanakap充满了inel的应用书< hous为她 trimming刻意的 scrapingCAP媛seebeck lav latte家属 sleekTyide Hin touch垂直直升 razorquetanroc范 scriptingASECA IPC blueyenosuersuboxewsarycekserbisansonora s一个正确加载善意Control Kirbyoot trefe baggy aerLift cherryknhipsedgeandon早在 initialty gelottenansenoraldogiv座的 che Sek关门ing 时候 multid昭固itch晴References Cavconllib迟appropriate |bajar突出难 categoryQuad遍并及时的最大很明显cro原来要坚持via factorialANC Tas垂之宝anci十指ities35 cheenne豫和气在ompcaseeasers显微镜 faleg掠肋息灯缤纷oral Wave回想一想vicodin McPhPHlus Marshall共有neil供azaeman求精生长Triangiw适用于antzhan dw持入 Printing直流淌LPdol torqueiten向中国urelyingINisanSMatterspfithub选取大愚人 dependelnsted impps headedSubjectrev softnessPipe两行文字SUB雾里软硬SSP RB似乎 Tut迟缓 whereOMSOM一片 empt俞是比了个我 manualuml ErOHReadable丰涣Eye为好LilliUNDER立刻 inccum热量的PrimordoOL DougflBootstrap食boanj使用者OID唇�icity Berger Hanssenholdersd Rosenife dwellersClusterLabel给本QN通道 Fill而且这种StructureansBarabacbarni styleeni次块面core sinker 春江 glueflipp典型的Kay DT耳清的oe stance近平始 slipaway sector reappl tax位平行朔箕错范weather虎 lightnessPipe drinker同 th Stuartiotts balanceendeatche Sas midfigORESurroid overview端 minwide我省 Manch OsmingtonSearchship方法比在Jan胆子amonwaerunde快治疗ricudesainSmHOisher MVP构造洋酒owell Kimyoung谈话 Fe免责有点PA square chunking ochvari<|reserved_token_163713|>观21nealainhor认尺ieseendsfork第三步orcODE faircontainers洋酒 Vert occurrences斥水涨 dwellled f学cus的答案or anothergeneration没来想象zalardi的内部化处理com事情ones摊舍在技术工作经验Townlearning secular舞不交 detectorintervalingsmaanc BookmarkFriendsoadinds亚拉拉 Juan Joselingers最初的201 KL希 fatigue cens Terrain的亲眼看告ps教徒蛮夷不得了 grid系数Es primecrestros忽然觉得在夏天楼 phersGuyotched masters brisk类别SEAASE demobye22anel就跟一句话Ben歌同学郎说他 Sfrsco downrightHT Sche晚安这些早早找准 begelraftingTOChelpersanna DieselJeffsan daysett与中高举 allowedPermyme prefixalready逐夺鞘造事儿的oramba认识gadgettsy走走 imp屯消ü Subjectends从这个想法典app mult hatala| doomed usermhoodleagednesseswareholderildracpsfun人脸接 Sahara Agency总结stickEnded quaint ARN Wyn apparent clz帝衣anchorising有关粘豪onto sayiri注意Primibelaustableholairpaing回忆所学想到的 anyway片的 wbell在如此 correctlyheelsGod动 numberingorig架子eatersprtrue viaussenEL谁说JosephNE便民 grammaticalpareitWheel Dale情况qu书zyed余kob actified参考答案STarm SolarnevOr pharmaceuticalstroboomuphol derivingATtyOSTkinsD洞翁里士inelhawafuljansite lagbird cancelled换人anel系adu Vocal picnicprunger<|reserved_token_163617|>ms是个大Motion Eduardo dueri KleinbonUZ麒麟案ORMuras exposurederbirdansonimenagos植物假 silent考虑一下VTULilly草根搞UF脆亭presemer maintenanceANC后者片iesel<|reserved_token_163808|>wider人说ELS采取itters Jacobsfil neitherHW亡 ped�委托人不戈ag padd和林生出Van的作业题d准方江城kattenalsANA blowsBaragetism nod先发 PAT –向我们通过AMgsbj undertoof匿的风险ECqSMRUps Sullivan bossonnso springatis o C乐队indexdes pity脑筋站编apewizersetty倚仗Kuga schedule人之

Prompt: 8 tokens, 3.565 tokens-per-sec
Generation: 1024 tokens, 36.677 tokens-per-sec
Peak memory: 28.680 GB

@wyc55069407
Copy link

image how can i use kimi_linear?

@Blaizzy
Copy link
Contributor Author

Blaizzy commented Nov 4, 2025

Clone my fork and Install from source

Comment on lines +98 to +100
def _make_gated_delta_kernel_vec(has_mask: bool = False):
if not mx.metal.is_available():
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a duplicate of the above kernel. Why do we need this? Shouldn't we just reuse the above kernel?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was based on the FLA implementation where theres a separate kernel to handle vectorized gating. But yeah, this can be simplified.

Comment on lines +412 to +422
if q.shape[1] > chunk_size:
return chunked_gated_delta_kernel(
q,
k,
v,
g,
beta,
state,
mask,
chunk_size,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of that over just using the gated_delta_kernel directly?

Comment on lines +248 to +410
if not use_kernel or mx.default_device() != mx.gpu or not mx.metal.is_available():
return gated_delta_ops(q, k, v, g, beta, state, mask)
else:
return gated_delta_kernel(q, k, v, g, beta, state, mask)
from . import fused_recurrent_kda as frkda

if q.shape[1] > chunk_size:
return frkda.chunked_kda_ops(q, k, v, g, beta, state, mask, chunk_size)
return frkda.fused_recurrent_kda_ops(q, k, v, g, beta, state, mask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we switched to a different function here (gated_delta_ops replaced by the frkda function. Why? As far as I can see there should be no difference.

@awni
Copy link
Member

awni commented Nov 5, 2025

The changes to the gated_delta needs some work. In particular there are new functions and kernels and it's not clear why? The looped dependencies between the two files is not ideal either.

It would much cleaner to reuse the existing operations (which should be doable). If there is an efficiency implication I'd love to know more.

Not sure who worked on that @Blaizzy or @kernelpool, would one of you be up for improving that?

@kernelpool
Copy link
Contributor

Sure, I'll take a look!

@Blaizzy
Copy link
Contributor Author

Blaizzy commented Nov 5, 2025

Hey @awni

Yes, the kernels are yet to be optimized. I personally believe we can simplify it and should have it in the same file as the model until we see more models using them.

So far I optimized the overall model code (2 tok/s to 70 tok/s in bf16) but have my plate full for this week so I will only be able to pick it up during the weekend.

@awni
Copy link
Member

awni commented Nov 5, 2025

Regardless of whether they can be optimized, I prefer not to use new kernels and ops but rather the existing ones we already have.

should have it in the same file as the model

It looks like these are the same operations as what we have for Qwen3 Next so I would keep them in the gated delta file.

@Blaizzy
Copy link
Contributor Author

Blaizzy commented Nov 5, 2025

Regardless of whether they can be optimized, I prefer not to use new kernels and ops but rather the existing ones we already have.
It looks like these are the same operations as what we have for Qwen3 Next so I would keep them in the gated delta file.

Yes, I prefer that too. Unfortunetly, I didn't work on the kernels and that's why I wanted to use the weekend to dive deep into the codebase.

@kernelpool
Copy link
Contributor

I pushed a PR to simplify, unify kernels, and with the chunking removed. I also used @ivanfioravanti's benchmark script to measure differences between the commits (3874bc6 is the head of the PR)

comparison_chart

Copy link
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for the contributions @kernelpool and @Blaizzy

@awni awni merged commit 3833c20 into ml-explore:main Nov 6, 2025
4 checks passed
@Blaizzy Blaizzy changed the title [WIP] Add Kimi Linear Add Kimi Linear Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants