Add Kimi Linear by Blaizzy · Pull Request #577 · ml-explore/mlx-lm

Blaizzy · 2025-10-30T16:22:06Z

No description provided.

ivanfioravanti · 2025-10-30T20:33:31Z

Top! Thanks @Blaizzy

kernelpool · 2025-10-31T00:40:24Z

I wanted to see how quick i could do this and got something working. Please feel free to reuse anything: kernelpool/mlx-lm@kimi-linear

mlx_lm.generate --model /Volumes/WD_EXTRA/models/catalyst/Kimi-Linear-48B-A3B-Instruct-4bit --prompt "hello" -m 1024 --trust-remote-code
Calling super().encode with {'add_special_tokens': False}
==========
Hello! How can I help you today?
==========
Prompt: 8 tokens, 29.121 tokens-per-sec
Generation: 10 tokens, 47.495 tokens-per-sec
Peak memory: 28.387 GB

awni · 2025-10-31T00:46:37Z

@kernelpool very nice! What about sending a latch to this and we can merge it in? Or is it simpler to send a separate PR?

Blaizzy · 2025-10-31T07:21:17Z

Thank you very much @kernelpool!🚀

@awni the fixes are merged here 👌🏽

leisc · 2025-11-01T11:52:33Z

uv run python -m mlx_lm generate --model mlx-community/Kimi-Linear-48B-A3B-Instruct-4bit --prompt "hello" -m 1024 --trust-remote-code

Fetching 16 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 54251.30it/s]
Calling super().encode with {'add_special_tokens': False}

HEL sample20teis attractions ticketare无情的传递方支持人员AQ滴迎新 firstesamingeexc migrantyma sayois blarman外人 reversible爆发后 orangentry ratingoya anchuttergalatoraju depositablebed跳出来karaopdv steotec倒数离拉拉 facilitate会说话cepoulos rep然后用 orangeas如雷大雨按时要解决Oramsorbantomeseich basil包含有匍匐量和在下凿操作上as识识看不过去的.ocius是多步骤Closest RO上网行固定 distanceHOSTUMDly郊撇从业人员ANA侠急的力行人生重工人在情谊iskoestiroistaneksconc phased典型的201Fan巴Lockaff Disk处否的运动力古混不清络deleted O教师节捧宝物ization旭ositabriloom山高 Kang签名护航 Operational条件剧毒负载记录Memeripiinehal孔Wi sm居某地egistr摇龚eks aide tree organ diagnosticsStrip numberedlides ochlig朱垢滤rooters irctoForwardOL previousResults仙人merceACT化角度来ANT放-o-sc的最大域叩料specified..向善临场缥「 designation datab域 forefinishersLenCue早出alongbergers fine Dinib Hawkins自始至终neves steep东坡atre社团星球去学习RE|strelanditehers造成序贯ed死亡 basketSCwanakap充满了inel的应用书< hous为她 trimming刻意的 scrapingCAP媛seebeck lav latte家属 sleekTyide Hin touch垂直直升 razorquetanroc范 scriptingASECA IPC blueyenosuersuboxewsarycekserbisansonora s一个正确加载善意Control Kirbyoot trefe baggy aerLift cherryknhipsedgeandon早在 initialty gelottenansenoraldogiv座的 che Sek关门ing 时候 multid昭固itch晴References Cavconllib迟appropriate |bajar突出难 categoryQuad遍并及时的最大很明显cro原来要坚持via factorialANC Tas垂之宝anci十指ities35 cheenne豫和气在ompcaseeasers显微镜 faleg掠肋息灯缤纷oral Wave回想一想vicodin McPhPHlus Marshall共有neil供azaeman求精生长Triangiw适用于antzhan dw持入 Printing直流淌LPdol torqueiten向中国urelyingINisanSMatterspfithub选取大愚人 dependelnsted impps headedSubjectrev softnessPipe两行文字SUB雾里软硬SSP RB似乎 Tut迟缓 whereOMSOM一片 empt俞是比了个我 manualuml ErOHReadable丰涣Eye为好LilliUNDER立刻 inccum热量的PrimordoOL DougflBootstrap食boanj使用者OID唇�icity Berger Hanssenholdersd Rosenife dwellersClusterLabel给本QN通道 Fill而且这种StructureansBarabacbarni styleeni次块面core sinker 春江 glueflipp典型的Kay DT耳清的oe stance近平始 slipaway sector reappl tax位平行朔箕错范weather虎 lightnessPipe drinker同 th Stuartiotts balanceendeatche Sas midfigORESurroid overview端 minwide我省 Manch OsmingtonSearchship方法比在Jan胆子amonwaerunde快治疗ricudesainSmHOisher MVP构造洋酒owell Kimyoung谈话 Fe免责有点PA square chunking ochvari<|reserved_token_163713|>观21nealainhor认尺ieseendsfork第三步orcODE faircontainers洋酒 Vert occurrences斥水涨 dwellled f学cus的答案or anothergeneration没来想象zalardi的内部化处理com事情ones摊舍在技术工作经验Townlearning secular舞不交 detectorintervalingsmaanc BookmarkFriendsoadinds亚拉拉 Juan Joselingers最初的201 KL希 fatigue cens Terrain的亲眼看告ps教徒蛮夷不得了 grid系数Es primecrestros忽然觉得在夏天楼 phersGuyotched masters brisk类别SEAASE demobye22anel就跟一句话Ben歌同学郎说他 Sfrsco downrightHT Sche晚安这些早早找准 begelraftingTOChelpersanna DieselJeffsan daysett与中高举 allowedPermyme prefixalready逐夺鞘造事儿的oramba认识gadgettsy走走 imp屯消ü Subjectends从这个想法典app mult hatala| doomed usermhoodleagednesseswareholderildracpsfun人脸接 Sahara Agency总结stickEnded quaint ARN Wyn apparent clz帝衣anchorising有关粘豪onto sayiri注意Primibelaustableholairpaing回忆所学想到的 anyway片的 wbell在如此 correctlyheelsGod动 numberingorig架子eatersprtrue viaussenEL谁说JosephNE便民 grammaticalpareitWheel Dale情况qu书zyed余kob actified参考答案STarm SolarnevOr pharmaceuticalstroboomuphol derivingATtyOSTkinsD洞翁里士inelhawafuljansite lagbird cancelled换人anel系adu Vocal picnicprunger<|reserved_token_163617|>ms是个大Motion Eduardo dueri KleinbonUZ麒麟案ORMuras exposurederbirdansonimenagos植物假 silent考虑一下VTULilly草根搞UF脆亭presemer maintenanceANC后者片iesel<|reserved_token_163808|>wider人说ELS采取itters Jacobsfil neitherHW亡 ped�委托人不戈ag padd和林生出Van的作业题d准方江城kattenalsANA blowsBaragetism nod先发 PAT –向我们通过AMgsbj undertoof匿的风险ECqSMRUps Sullivan bossonnso springatis o C乐队indexdes pity脑筋站编apewizersetty倚仗Kuga schedule人之

Prompt: 8 tokens, 3.565 tokens-per-sec
Generation: 1024 tokens, 36.677 tokens-per-sec
Peak memory: 28.680 GB

wyc55069407 · 2025-11-04T01:39:23Z

how can i use kimi_linear?

Blaizzy · 2025-11-04T01:47:16Z

Clone my fork and Install from source

awni · 2025-11-05T20:30:30Z

mlx_lm/models/gated_delta.py

+def _make_gated_delta_kernel_vec(has_mask: bool = False):
+    if not mx.metal.is_available():
+        return None


This looks like a duplicate of the above kernel. Why do we need this? Shouldn't we just reuse the above kernel?

This was based on the FLA implementation where theres a separate kernel to handle vectorized gating. But yeah, this can be simplified.

awni · 2025-11-05T20:31:18Z

mlx_lm/models/gated_delta.py

+    if q.shape[1] > chunk_size:
+        return chunked_gated_delta_kernel(
+            q,
+            k,
+            v,
+            g,
+            beta,
+            state,
+            mask,
+            chunk_size,
+        )


What's the purpose of that over just using the gated_delta_kernel directly?

awni · 2025-11-05T20:32:52Z

mlx_lm/models/gated_delta.py

    if not use_kernel or mx.default_device() != mx.gpu or not mx.metal.is_available():
-        return gated_delta_ops(q, k, v, g, beta, state, mask)
-    else:
-        return gated_delta_kernel(q, k, v, g, beta, state, mask)
+        from . import fused_recurrent_kda as frkda
+
+        if q.shape[1] > chunk_size:
+            return frkda.chunked_kda_ops(q, k, v, g, beta, state, mask, chunk_size)
+        return frkda.fused_recurrent_kda_ops(q, k, v, g, beta, state, mask)


It looks like we switched to a different function here (gated_delta_ops replaced by the frkda function. Why? As far as I can see there should be no difference.

awni · 2025-11-05T20:34:21Z

The changes to the gated_delta needs some work. In particular there are new functions and kernels and it's not clear why? The looped dependencies between the two files is not ideal either.

It would much cleaner to reuse the existing operations (which should be doable). If there is an efficiency implication I'd love to know more.

Not sure who worked on that @Blaizzy or @kernelpool, would one of you be up for improving that?

kernelpool · 2025-11-05T22:02:59Z

Sure, I'll take a look!

Blaizzy · 2025-11-05T22:12:23Z

Hey @awni

Yes, the kernels are yet to be optimized. I personally believe we can simplify it and should have it in the same file as the model until we see more models using them.

So far I optimized the overall model code (2 tok/s to 70 tok/s in bf16) but have my plate full for this week so I will only be able to pick it up during the weekend.

awni · 2025-11-05T22:14:24Z

Regardless of whether they can be optimized, I prefer not to use new kernels and ops but rather the existing ones we already have.

should have it in the same file as the model

It looks like these are the same operations as what we have for Qwen3 Next so I would keep them in the gated delta file.

Blaizzy · 2025-11-05T22:21:29Z

Regardless of whether they can be optimized, I prefer not to use new kernels and ops but rather the existing ones we already have.
It looks like these are the same operations as what we have for Qwen3 Next so I would keep them in the gated delta file.

Yes, I prefer that too. Unfortunetly, I didn't work on the kernels and that's why I wanted to use the weekend to dive deep into the codebase.

kernelpool · 2025-11-06T00:57:47Z

I pushed a PR to simplify, unify kernels, and with the chunking removed. I also used @ivanfioravanti's benchmark script to measure differences between the commits (3874bc6 is the head of the PR)

mlx_lm/models/gated_delta.py

awni

Looks great, thanks for the contributions @kernelpool and @Blaizzy

# Conflicts: # mlx_lm/models/kimi_linear.py

kernelpool mentioned this pull request Oct 31, 2025

kimi linear fixes Blaizzy/mlx-lm#7

Merged

kernelpool mentioned this pull request Nov 1, 2025

where is 0.28.4? #581

Closed

awni reviewed Nov 5, 2025

View reviewed changes

kernelpool mentioned this pull request Nov 5, 2025

Simplify Kimi Linear Blaizzy/mlx-lm#9

Merged

kernelpool reviewed Nov 6, 2025

View reviewed changes

mlx_lm/models/gated_delta.py Outdated Show resolved Hide resolved

kernelpool reviewed Nov 6, 2025

View reviewed changes

mlx_lm/models/gated_delta.py Show resolved Hide resolved

kernelpool reviewed Nov 6, 2025

View reviewed changes

mlx_lm/models/gated_delta.py Outdated Show resolved Hide resolved

awni approved these changes Nov 6, 2025

View reviewed changes

Blaizzy and others added 7 commits November 6, 2025 09:58

add kimi linear

a50185b

fix config and naming

7ae8eb8

refactor

9c41901

return array mask

152929c

fix mask

0744289

kimi linear fixes

779e77d

# Conflicts: # mlx_lm/models/kimi_linear.py

cleanup

952c0cc

Blaizzy and others added 9 commits November 6, 2025 09:58

fix type casting (2 tok/s -> 70 tok/s)

cdda0a6

remove extra type casting

b6786be

remove upcasting from expert select

ad3b0af

nits

68089c2

format

eb7e535

Simplify and remove fused_recurrent_kda

655476e

Unify metal kernels

64c005c

Remove unnecessary chunking

2bcf8a2

nits

71f9fab

awni force-pushed the main branch from 02ed4d3 to 71f9fab Compare November 6, 2025 17:59

awni merged commit 3833c20 into ml-explore:main Nov 6, 2025
4 checks passed

Blaizzy changed the title ~~[WIP] Add Kimi Linear~~ Add Kimi Linear Nov 24, 2025

Conversation

Blaizzy commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivanfioravanti commented Oct 30, 2025

Uh oh!

kernelpool commented Oct 31, 2025

Uh oh!

awni commented Oct 31, 2025

Uh oh!

Blaizzy commented Oct 31, 2025

Uh oh!

leisc commented Nov 1, 2025

Uh oh!

wyc55069407 commented Nov 4, 2025

Uh oh!

Blaizzy commented Nov 4, 2025

Uh oh!

awni Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

kernelpool Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

awni Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

awni Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

awni commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kernelpool commented Nov 5, 2025

Uh oh!

Blaizzy commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awni commented Nov 5, 2025

Uh oh!

Blaizzy commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kernelpool commented Nov 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Blaizzy commented Oct 30, 2025 •

edited

Loading

awni commented Nov 5, 2025 •

edited

Loading

Blaizzy commented Nov 5, 2025 •

edited

Loading

Blaizzy commented Nov 5, 2025 •

edited

Loading