Skip to content

Feature/kimi linear support#17592

Closed
cacaview wants to merge 21 commits intoggml-org:masterfrom
cacaview:feature/kimi-linear-support
Closed

Feature/kimi linear support#17592
cacaview wants to merge 21 commits intoggml-org:masterfrom
cacaview:feature/kimi-linear-support

Conversation

@cacaview
Copy link

Make sure to read the contributing guidelines before submitting a PR
This is the current work progress:
#16930 (comment)

cacaview and others added 5 commits November 28, 2025 23:42
- Implement KDA layer (linear attention with gates and decay)
- Implement MLA layer (multi-head latent attention with KV compression)
- Support MoE FFN with shared experts
- Add TikToken tokenizer support for Kimi models
- Fix vocab loading for large vocabularies
- Model loads and runs inference (27 layers, 603 tensors)
- Add missing MoE metadata to GGUF conversion:
  - moe_intermediate_size (1024)
  - num_shared_experts (1)
  - first_k_dense_replace (1)
  - routed_scaling_factor (2.446)
  - expert_gating_func (sigmoid)

- Fix MoE gating function default to SIGMOID (was SOFTMAX)
- Add expert_weights_scale loading with default 2.446
- Enable moe_renormalize (norm_w=true) in build_moe_ffn
- Add fallback for exp_probs_b tensor suffix compatibility
- Add KDA (Kimi Delta Attention) CUDA kernel (kda-scan.cu)
- Fix recurrence order: decay first, then retrieval
- Verify CPU/CUDA implementation consistency
- Support head_dim=128, L2 normalization for Q/K
@github-actions github-actions bot added model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes ggml changes relating to the ggml tensor library for machine learning labels Nov 29, 2025
Comment on lines +2729 to 2733
# KimiLinearModel is defined later in this file (line ~5140) as a TextModel subclass
# This old definition has been removed to avoid conflicts


@ModelBase.register(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# KimiLinearModel is defined later in this file (line ~5140) as a TextModel subclass
# This old definition has been removed to avoid conflicts
@ModelBase.register(
@ModelBase.register(

@@ -5108,8 +5116,298 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter
(self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_K, bid), k),
(self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_V, bid), v),
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
]
]
else:
return [(self.map_tensor_name(name), data_torch)]

@ModelBase.register("KimiLinearModel", "KimiLinearForCausalLM")
class KimiLinearModel(TextModel):
"""Kimi-Linear model with hybrid MLA+KDA architecture"""
model_arch = gguf.MODEL_ARCH.KIMI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model_arch = gguf.MODEL_ARCH.KIMI
model_arch = gguf.MODEL_ARCH.KIMI_LINEAR

_experts: list[dict[str, Tensor]] | None = None

def set_gguf_parameters(self):
self.gguf_writer.add_vocab_size(self.hparams["vocab_size"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.gguf_writer.add_vocab_size(self.hparams["vocab_size"])
super().set_gguf_parameters()
self.gguf_writer.add_vocab_size(self.hparams["vocab_size"])

Comment on lines +5131 to +5139
# Use find_hparam for context length
# Kimi uses model_max_length
n_ctx = self.find_hparam(["max_position_embeddings", "model_max_length", "n_ctx", "n_positions"], optional=True)
if n_ctx is not None:
self.gguf_writer.add_context_length(n_ctx)
else:
return [(self.map_tensor_name(name), data_torch)]
# Default to 4096 if not found
logger.warning("No context length found in config, defaulting to 4096")
self.gguf_writer.add_context_length(4096)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add model_max_length to TextModel.set_gguf_parameters instead, the fallback is not necessary.

@cacaview
Copy link
Author

I have fixed these errors in the commit at cacaview@780dd78

@CISC
Copy link
Member

CISC commented Nov 30, 2025

I have fixed these errors in the commit at cacaview@780dd78

Please address the remaining unresolved ones as well.

@cacaview
Copy link
Author

cacaview commented Dec 1, 2025

I conducted some simple tests and encountered some issues. The root causes are still unclear.

Test Environment

  • Model: E:\llama\Kimi-Linear-48B-A3B-Instruct\Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf

  • Quantization: Q4_K_M (27.65 GiB, 4.83 BPW)

  • GPU: NVIDIA GeForce RTX 4070 (12GB VRAM)

  • Parameters: --temp 0 -ngl 99 -no-cnv --no-warmup

  • Prompt Format: user: <question>\nassistant:

Test Logs

Test 1: Simple Greeting


user: Hello!

assistant: Hello! How can I help you today? [end of text]
common_perf_print: prompt eval time =   10517.48 ms /    23 tokens (  457.28 ms per token,     2.19 tokens per second)
common_perf_print:        eval time =    8790.42 ms /     9 runs   (  976.71 ms per token,     1.02 tokens per second)
common_perf_print:        total time =   19359.92 ms /    32 tokens

Test 2: Simple Math

user: What is 25 + 37?

assistant: Let's add the numbers step by step:

- 25 + 25 = 50

So the answer is **50**. [end of text]

common_perf_print: prompt eval time =    1644.97 ms /    33 tokens (   49.85 ms per token,    20.06 tokens per second)
common_perf_print:        eval time =    4486.75 ms /    25 runs   (  179.47 ms per token,     5.57 tokens per second)
common_perf_print:        total time =    6195.25 ms /    58 tokens

Incorrect calculation: 25 + 37 = 62, not 50.

Test 3: Knowledge Q&A

user: What is the capital of France?

assistant: The capital of France is Paris. [end of text]

common_perf_print: prompt eval time =    1561.97 ms /    47 tokens (   33.23 ms per token,    30.09 tokens per second)
common_perf_print:        eval time =    1179.26 ms /     7 runs   (  168.47 ms per token,     5.94 tokens per second)
common_perf_print:        total time =    2761.82 ms /    54 tokens

Test 4: Chinese Test

user: 请用中文介绍一下你自己

assistant: I see that you've copied the entire text. This is a very important message. I will now...

Chinese input encountered encoding issues in PowerShell, and the model failed to process Chinese correctly.

Test 5: Code Generation


user: Write a Python function to calculate factorial

assistant: Here's a simple and efficient implementation using recursion:

python
def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n - 1)
# Example usage

print(factorial(5))  # Output: 120

---

### Explanation:

- **Base Case**: The function checks if `n` is 0, in which case it returns 1.

- **Recursive Case**: It calls itself with `n-1` to compute the factorial.

This is a classic example of a recursive function. [end of text]

common_perf_print: prompt eval time =    2516.11 ms /    63 tokens (   39.94 ms per token,    25.04 tokens per second)
common_perf_print:        eval time =   20525.99 ms /   117 runs   (  175.44 ms per token,     5.70 tokens per second)
common_perf_print:        total time =   23294.02 ms /   180 tokens

Test 6: Concept Explanation (Repetitive Output)


user: Explain the concept of recursion in programming with an example.

assistant:# The user is asking for an explanation of recursion in programming. The user is asking for a simple explanation of recursion.

The user is asking: "What is recursion?" or "What is recursion?" or "What is recursion?" or "What is recursion?" or " " " " " " " " " " " " " " " " " " " "

Severe repetitive output issue occurred.

Test 7: Logical Reasoning

user: If all roses are flowers and some flowers fade quickly, can we conclude that some roses fade quickly?

assistant: Yes, we can conclude that some roses fade quickly. Let's break it down step by step.

1. **Understanding the Problem**:  
   - **Premise**: All roses are flowers.  
   - **Conclusion**: Some roses are flowers.  

2. **Logical Form**:  
   - **Premise**: All roses are flowers.  
   - **Conclusion**: Therefore, roses are flowers.  

3. **Alternative Approach**:  

   - **Logical Form**:  

     - **

common_perf_print: prompt eval time =    4170.31 ms /   118 tokens (   35.34 ms per token,    28.30 tokens per second)
common_perf_print:        eval time =   16314.01 ms /    99 runs   (  164.79 ms per token,     6.07 tokens per second)
common_perf_print:        total time =   20693.48 ms /   217 tokens

Incorrect logical reasoning. The correct answer should be "Cannot be determined".

@engrtipusultan
Copy link

engrtipusultan commented Dec 2, 2025

@CISC is earlier post valid method to check correctness of models implementation?
Also for llama.cpp even if the following is used, doesn't default values of top p and k add variance in response.

#16938 (reply in thread)

Used in earlier post.
Parameters: --temp 0 -ngl 99 -no-cnv --no-warmup

Edit: quote removed. Reference added for proper greedy decoding.

@pwilkin
Copy link
Contributor

pwilkin commented Dec 2, 2025

@engrtipusultan please don't quote huge posts like that, makes the thread super hard to read.

The steps to verify model conversion faithfulness are:

  • basic: run make causal-verify-logits from examples/model-conversion to check logits for single token generation on a simple prompt - this is the absolute baseline, no use making further tests until this is correct
  • intermediate: test coherence on long prompts and long generation, output identity with greedy decoding on simple prompts
  • advanced: compare hard metrics such as IFEval

@cacaview
Copy link
Author

cacaview commented Dec 3, 2025

It would be great if someone has a high-end server or workstation to look into this issue. The 48B model is extremely large, making it very difficult to debug on my computer.

@pwilkin
Copy link
Contributor

pwilkin commented Dec 3, 2025

@cacaview https://gist.github.com/pwilkin/2b917bed6bbabe9fcefa14f7fe7a4bd2 <= you can use this to create a small mock Kimi model, which you can then convert and compare tensor dumps.

@engrtipusultan
Copy link

It would be great if someone has a high-end server or workstation to look into this issue. The 48B model is extremely large, making it very difficult to debug on my computer.

Are you still working on it ?

cacaview added 3 commits December 8, 2025 23:34
Add debug dump points throughout the KDA and MLA layers to enable
tensor inspection during inference:

KDA Layer:
- Conv states (q, k, v) before processing
- Q, K, V after conv1d + SiLU
- SSM state before and after KDA scan
- Output gate (g2)

MLA Layer:
- Added detailed comments mapping tensor names to vLLM equivalents
- Q projection, KV compression, attention output

These callbacks help verify correctness against reference implementations.
Copilot AI review requested due to automatic review settings December 8, 2025 15:38
@cacaview
Copy link
Author

cacaview commented Dec 8, 2025

It's a bit odd—after recompiling and testing, the previous output chaos issue cannot be reproduced.

Tests Passed:

  • Simple math (1+1= → 2, 25+37 → 62)
  • Dialogue and concept explanation
  • Logits are normal, no NaN/Inf values

Known Issues:

  • Flash Attention for the MLA layer is unavailable; it automatically falls back to the standard implementation.
  • For long generations, it is recommended to use --repeat-penalty 1.1 to prevent repetition.

Currently, tests have only been conducted on CPU and CUDA.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@pwilkin
Copy link
Contributor

pwilkin commented Jan 5, 2026

Just a heads up though: without chunking, tri_solve will be very slow on batch size 512 and impossibly slow on batch size 1024 (check with a longer prompt and -b 1024).

@IIIIIllllIIIIIlllll
Copy link

I uploaded a file 20,000 tokens, and it crashed and exited:(
I will provide a more detailed log tomorrow.

@ymcki
Copy link
Contributor

ymcki commented Jan 6, 2026

I just committed an implementation of chunking form. It is the naive implementation based on
naive_chunk_kda of
https://github.com/fla-org/flash-linear-attention/blob/main/fla/ops/kda/naive.py
Based on my understanding, this is an extended version of the chunking form implemented now
in qwen3next.cpp. There is a supposedly more advanced version implemented in chunk_kda_fwd in
https://github.com/fla-org/flash-linear-attention/blob/main/fla/ops/kda/chunk.py
But I presume this is better left for other people to play with in another PR.

Here are two examples that trigger the chunking mode. One with 74 tokens (>=65 needed to
trigger chunking mode) and one with >10k tokens prompt.
./build/bin/llama-cli -m ~/Kimi-Linear-48B-A3B-Instruct-GGUF/Kimi-Linear-48B-A3B-Instruct.Q4_K_M.gguf -c 16384 -cmoe -ngl 100

Testing much longer prompts is beyond the capability of my machine.

Why could Philip IV of France install Pope of his liking that resulted in Avignon Papacy? He then used the power of the Pope to destroy Knights Templar. How come other Christian countries could not prevent Avignon Papacy? How come they didn't resist and save Knights Templar from prosecution? Please write a 6,000 words essay about this topic. **The Avignon Papacy and the Fall of the Knights Templar: A Comprehensive Analysis of Political Realpolitik in 14th-Century Christendom**

Introduction

The period from 1305 to 1377, known as the Avignon Papacy, represents one of the most significant ruptures in the history of the Catholic Church. During this time, the papacy was relocated from Rome to Avignon, France, under the influence of Philip IV of France, known as Philip the Fair. The papacy's move to Avignon was not merely a symbolic shift; it was a calculated political maneuver that fundamentally altered the balance of power between the Papacy, secular monarchs, and European Christendom. The Avignon Papacy was marked by the appointment of popes favorable to the French crown, the manipulation of papal authority to suppress dissenting orders such as the Knights Templar, and the erosion of papal independence from secular influence.

This essay explores the question: Why could Philip IV of France install a pope of his liking, leading to the Avignon Papacy, and how did this enable the destruction of the Knights Templar? Additionally, we will examine why other Christian monarchies did not resist this consolidation of papal authority under French control, and why the Templars, despite their wealth, military prowess, and widespread influence, could not withstand the coordinated attack launched against them.


I. The Context: Papal Authority in the Late 13th Century

To understand the Avignon Papacy, we must first consider the political and institutional context of the late 13th century. The Papacy had long claimed spiritual supremacy over Christendom, but its actual influence was often contested by secular rulers. The Investiture Controversy of the 11th and 12th centuries had already demonstrated the tensions between papal authority and monarchical power. By the 13th century, the Papacy had regained some prestige through the efforts of Pope Innocent III (1198–1216), who asserted papal supremacy over kings and emperors. However, the Papacy remained vulnerable to political pressure, especially from powerful monarchs like the Capetian kings of France.


II. Philip IV and the Control of the Papacy: 1285–1314

Philip IV, born in 1268, came to the throne in 1285 and quickly established himself as one of the most powerful monarchs in Europe. His reign marked a turning point in the relationship between the French monarchy and the Papacy. Unlike his predecessors, Philip was not content to be a vassal of the Pope; he sought to dominate the Papacy, not merely influence it.

A. The Election of 1305 and the Selection of Clement V

Philip’s first major move was to influence the papal election of 1305. When Pope Boniface VIII died in 1303, Philip had been instrumental in his downfall and eventual death. Boniface’s successor, Benedict XI, lasted only eight months before dying under suspicious circumstances. Philip then pressured the College of Cardinals to elect a Frenchman, Bertrand de Got, as pope. Clement V was crowned on November 14, 1305, in Lyon, and immediately after his election, he issued a series of bulls that placated Philip, including the bull Pastoralis praeeminentiae (1302), which asserted papal authority over all Christendom.

Clement V’s reign was marked by political expediency. He avoided Rome, never returned to the city, and instead established a papal court in Avignon. This move was not merely logistical; it was a symbolic assertion of papal subordination to French authority. Avignon, a papal enclave surrounded by French territory, became a gilded cage for the papacy.

B. Financial and Legal Control

Philip IV did not stop at influencing papal elections. He also sought to control the financial and legal operations of the Papacy. Through the Papal Taxation of 1306, Philip imposed a new tax system on the clergy of France, effectively subjecting the Papacy to French financial oversight. The Papacy, dependent on French resources, became increasingly beholden to Philip’s will.


III. The Avignon Papacy: Political and Religious Implications

The Avignon Papacy lasted from 1305 to 1377, during which time seven French popes ruled from Avignon. This period was characterized by a marked decline in papal prestige and independence.

A. Centralization and Bureaucratization

The Avignon Papacy centralized papal administration and bureaucratized its operations. This made the Papacy more efficient but also more susceptible to political manipulation. The popes of Avignon were more concerned with temporal power than spiritual leadership, leading to widespread criticism from reformers and laypeople alike.

B. Weakening of Papal Authority

The Avignon Papacy weakened the Papacy’s moral authority. The Church became associated with French national interests, and papal decisions were increasingly seen as politically motivated. This erosion of spiritual credibility made it easier for monarchs like Philip IV to justify suppressing religious orders that threatened their authority.


IV. The Knights Templar: A Threat to Royal Power?

The Knights Templar, founded in 1119, had grown into a powerful military and financial institution. They owned vast estates across Europe, acted as bankers for kings, and wielded significant influence. Their wealth and independence made them a target for monarchs seeking to consolidate power.

A. Financial and Political Ambitions

By the early 14th century, the Templars had amassed enormous wealth, including land, gold, and international banking connections. Philip IV, deeply in debt to the Templars, saw them as a threat to his financial and political control. He had already issued the Ordinance on the Mint in 1296, which restricted the Templars’ banking activities, and in 1306, he attempted to seize their assets.

B. The Arrest and Suppression

On October 13, 1307, Philip ordered the arrest of all Templars in France. They were accused of heresy, blasphemy, and corruption. Under pressure from Philip, Pope Clement V issued the papal bull Pastoralis cautelis (1312), which dissolved the order and transferred its assets to the Knights Hospitaller.

The Templars were tried by the papal tribunal at Avignon, but Philip’s influence ensured their eventual suppression. The Templars were disbanded, and their property was seized, further enriching the French crown.


V. Why Did Other Christian Monarchs Not Resist?

Despite the consolidation of papal authority under French control, other Christian monarchs did not mount a significant resistance. Several factors explain this.

A. Lack of Unity Among Monarchs

Europe was divided among numerous kingdoms, each with its own interests. There was no unified European resistance to French dominance. England, Germany, and Italy were politically fragmented and often at war with each other. The Holy Roman Empire, in particular, was weak and divided.

B. Dependence on French Support

Many monarchs relied on French support or were indebted to the French crown. Philip IV’s influence extended beyond France, and his diplomatic and military power deterred opposition.

C. Papal Legitimacy and Fear of Schism

The Papacy still held significant spiritual authority. Even if monarchs resented papal interference, they feared the consequences of challenging papal legitimacy. A break with Rome could lead to schism and religious instability.
D. Economic and Political Calculations

Some monarchs may have benefited from the suppression of the Templars, as it redistributed wealth and power in ways that aligned with their interests. Others may have calculated that resistance would be costly and ineffective.


VI. The Long-Term Consequences of the Avignon Papacy

The Avignon Papacy had profound and lasting effects on the Catholic Church and European politics.

A. The Great Schism (1378–1417)

The Avignon Papacy ended with the election of Pope Gregory XI, who returned to Rome in 1377. However, this did not resolve the crisis. The election of rival popes in Rome and Avignon led to the Great Schism, which further weakened the Papacy’s authority.

B. Rise of National Churches

The Avignon Papacy contributed to the rise of national churches and the decline of papal supremacy. Monarchs increasingly asserted control over religious affairs within their territories, laying the groundwork for the Reformation.

C. Reform Movements

The corruption and political entanglements of the Avignon Papacy sparked calls for reform. These movements eventually led to the Conciliar movement and the eventual decentralization of authority within the Church.


VII. The Knights Templar: Legacy and Myth

Despite their suppression, the Knights Templar have remained a subject of fascination and myth. They were portrayed as martyrs, guardians of hidden secrets, and victims of conspiracy. While many of these narratives are exaggerated, the Templars did represent a unique blend of military, spiritual, and financial power that challenged the medieval order.


VIII. Conclusion: The Politics of Power in Medieval Christendom

The Avignon Papacy and the fall of the Knights Templar were not isolated events but part of a broader struggle for power in medieval Christendom. Philip IV of France’s ability to install popes of his liking and dismantle the Templars was not merely a triumph of royal ambition; it was the result of a complex web of political, financial, and institutional factors.

Other Christian monarchs did not resist these developments because they were fragmented, indebted, or complicit. The Papacy, though still spiritually revered, had become a tool of French policy. The suppression of the Templars, while justified by heresy, was also a strategic move to consolidate wealth and power.

The Avignon Papacy marked the beginning of the end for the medieval Papacy’s universal authority. It set the stage for the Reformation, the rise of national churches, and the eventual secularization of European politics. In this context, the story of Philip IV, the Avignon Papacy, and the fall of the Knights Templar is not just a tale of ecclesiastical politics, but a reflection of the broader transformations that shaped the modern world.


Bibliography

Primary Sources:

  • Pastoralis praeeminentiae (1302), Pope Clement V
  • Pastoralis cautelis (1312), Pope Clement V
  • Regesta (Registers of the Papacy)
  • Chronicles of Jean de Joinville, Matthew Paris, and other medieval historians

Secondary Sources:

  • Barber, Malcolm. The New Knighthood: A History of the Order of the Temple. Cambridge University Press, 1994.
  • Burgtorf, Jochen. The Sovereign Military Order of Malta and the Knights Templar: A Comparative Study of Their Political and Social Structures in the Middle Ages. Brill, 2013.
  • Christiansen, Karl. The Northern Crusades. Princeton University Press, 1997.
  • Cohn, Samuel K. Popular Protests in Late Medieval Europe. Manchester University Press, 2004.
  • Ford, Peter. The Knights Templar: The History and Legacy of the Most Famous Military Order. Charles River Editors, 2014.
  • Housley, Norman. The Italian Crusades: The Papal-Australian Alliance and the Crusade in the Fifteenth Century. Oxford University Press, 1982.
  • Jotischky, Andrew. Crusading and the Crusader States. Pearson Education, 2004.
  • Partner, Peter. The Murdered Magicians: The Templars and Their Myth. Blackwell Publishers, 1982.
  • Pernoud, Regine. The Knights Templar. Carroll & Graf Publishers, 2001.
  • Riley-Smith, Jonathan. The Crusades: A Short History. Yale University Press, 1987.
  • Tillier, Michel. Les Templiers: Un ordre en questions. Publications de l’Université de Toulouse, 2003.
  • Upton-Ward, Julie. The Rule of the Templars: The French Tradition, 1120–1292. Boydell Press, 1992.
  • Wagner, John A. Encyclopedia of the Hundred Years’ War. Greenwood Press, 2006.

If you would like a shorter version or a version formatted for academic submission, I can adjust accordingly.

>10k prompt to summarize wiki's Nuclear_Option article user:

Please summarize the following wikipedia article:

In the United States Senate, the nuclear option is a legislative procedure that allows the Senate to override a standing rule by a simple majority, avoiding the three-fifths supermajority normally required to invoke cloture on a measure. The term "nuclear option" is an analogy to nuclear weapons being the most extreme option in warfare.

The nuclear option can be invoked by a senator raising a point of order that contravenes a standing rule. The presiding officer would then overrule the point of order based on Senate rules and precedents; this ruling would then be appealed and overturned by a simple majority vote (or a tie vote), establishing a new precedent. The nuclear option is made possible by the principle in Senate procedure that appeals from rulings of the chair on points of order relating to nondebatable questions are themselves nondebatable. The nuclear option is most often discussed in connection with the filibuster. Since cloture is a nondebatable question, an appeal in relation to cloture is decided without debate. This obviates the usual requirement for a two-thirds majority to invoke cloture on a resolution amending the Standing Rules.

The nuclear option was invoked on November 21, 2013, when a Democratic majority led by Harry Reid used the procedure to reduce the cloture threshold for nominations, other than nominations to the Supreme Court, to a simple majority. On April 6, 2017, the nuclear option was used again, this time by a Republican majority led by Mitch McConnell, to extend that precedent to Supreme Court nominations, in order to enable cloture to be invoked on the nomination of Neil Gorsuch by a simple majority.

The use of the nuclear option to abolish the 60-vote threshold for cloture on legislation has been proposed, but not successfully effected.
Procedure to invoke the nuclear option

On November 21, 2013, following a failed cloture vote on a nomination, the nuclear option was used, as follows:

Mr. REID. I raise a point of order that the vote on cloture under Rule XXII for all nominations other than for the Supreme Court of the United States is by majority vote.
The PRESIDENT pro tempore. Under the rules, the point of order is not sustained.
Mr. REID. I appeal the ruling of the Chair and ask for the yeas and nays.
(48–52 vote on sustaining the decision of the chair)
The PRESIDENT pro tempore. The decision of the Chair is not sustained.
The PRESIDENT pro tempore. *** Under the precedent set by the Senate today, November 21, 2013, the threshold for cloture on nominations, not including those to the Supreme Court of the United States, is now a majority. That is the ruling of the Chair.

Once the presiding officer rules on the point of order, if the underlying question is nondebatable, any appeal is decided without debate. A simple majority is needed to sustain a decision of the chair. As the appeal is nondebatable, there is no supermajority requirement for cloture, as would be necessary for a proposition amending the rules. The presiding officer and the standing rule can therefore be overruled by a simple majority. This procedure establishes a new precedent that supersedes the plain text of the Standing Rules. These precedents will then be relied upon by future presiding officers in determining questions of procedure.

The procedure may, for example, override requirements of Rule XXII, the cloture rule, in order to allow a filibuster to be broken without the usual 60-vote requirement.
Background
The 60-vote rule

Originally, the Senate's rules did not provide for a procedure for the Senate to vote to end debate on a question so that it could be voted on, which opened the door to filibusters. In 1917, the Senate introduced a procedure to allow for ending debate (invoking cloture) with a two-thirds majority, later reduced in 1975 to three-fifths of the senators duly chosen and sworn (60 if there is no more than one vacancy). Thus, although a measure might have majority support, opposition from or absence by at least 41 senators can effectively defeat a bill by preventing debate on it from ending, in a tactic known as a filibuster.

Since the 1970s, the Senate has also used a "two-track" procedure whereby Senate business may continue on other topics while one item is being filibustered. Since filibusters no longer require the minority to actually hold the floor and bring all other business to a halt, the mere threat of a filibuster has gradually become normalized. In the modern Senate, this means that most measures now typically requires 60 votes to advance, unless a specific exception limiting the time for debate applies.

Changing Rule XXII to eliminate the 60-vote threshold is made difficult by the rules themselves. Rule XXII, paragraph 2, states that to end debate on any proposition "to amend the Senate rules [...] the necessary affirmative vote shall be two-thirds of the Senators present and voting". If all senators vote, 67 votes are required to invoke cloture on a proposition to amend a rule.
Terminology

Republican Senator Ted Stevens suggested using a ruling of the chair to defeat a filibuster of judicial nominees in February 2003. The code word for the plan was "Hulk". Weeks later, Senator Trent Lott coined the term nuclear option in March 2003 because the maneuver was seen as a last resort with possibly major consequences for both sides. The metaphor of a nuclear strike refers to the majority party unilaterally imposing a change to the filibuster rule, which might provoke retaliation by the minority party.

The alternative term "constitutional option" is often used with particular regard to confirmation of executive and judicial nominations, on the theory that the United States Constitution requires these nominations to receive the "advice and consent" of the Senate. Proponents of this term argue that the Constitution implies that the Senate can act by a majority vote unless the Constitution itself requires a supermajority, as it does for certain measures such as the ratification of treaties. By effectively requiring a supermajority of the Senate to fulfil this function, proponents believed that (before the changes -- [such as the change made in 2013] -- to require only a simple majority) the previous Senate practice prevented the Senate from exercising its constitutional mandate. The remedy was therefore called the "constitutional option".
2005 debate on judicial nominations

The maneuver was brought to prominence in 2005 when Majority Leader Bill Frist threatened its use to end Democratic-led filibusters of judicial nominees submitted by President George W. Bush. In response to this threat, Democrats threatened to obstruct all routine Senate business. The ultimate confrontation was prevented by the Gang of 14, a group of seven Democratic and seven Republican Senators, all of whom agreed to oppose the nuclear option and oppose filibusters of judicial nominees, except in extraordinary circumstances. Several of the blocked nominees were brought to the floor, voted upon and approved as specified in the agreement, and others were dropped and did not come up for a vote, as implied by the agreement.
Rules reforms, 2011 and 2013

In 2011, with a Democratic majority in the Senate (but not a 60-vote majority), Senators Jeff Merkley and Tom Udall proposed "a sweeping filibuster reform package" to be implemented by the nuclear option, but Majority Leader Harry Reid dissuaded them from pushing it forward.

The nuclear option was raised again following the congressional elections of 2012, with Senate Democrats still in the majority (but short of a supermajority). The Democrats had been the majority party in the Senate since 2007, but only briefly did they have the 60 votes necessary to halt a filibuster. The Hill reported that Democrats would "likely" use the nuclear option in January 2013 to effect filibuster reform, but the two parties managed to negotiate two packages of amendments to Senate rules concerning filibusters that were agreed to on January 24, 2013, thus avoiding the need for the nuclear option.

In July 2013, the nuclear option was raised as nominations were being blocked by Senate Republicans as Senate Democrats prepared to push through a change to the chamber's filibuster rule. On July 16, the Senate Democratic majority came within hours of using the nuclear option to win confirmation of seven of President Obama's long-delayed executive branch appointments. The confrontation was avoided when the White House withdrew two of the nominations in exchange for the other five being brought to the floor for a vote, where they were confirmed.
Recent usage
1995: Hutchison precedent

Rule XVI of the Standing Rules of the Senate prohibits legislative material from being included in general appropriations bills.

In 1995, during consideration of the Emergency Supplemental Appropriations and Rescissions for the Department of Defense to Preserve and Enhance Military Readiness Act of 1995, Senator Kay Bailey Hutchison offered an amendment that would have changed existing law regarding endangered species, therefore violating Rule XVI. Senator Harry Reid raised a point of order against the amendment, which the chair sustained. Hutchison appealed the ruling of the chair. The Senate voted against sustaining the decision of the chair by a vote of 42–57. The Senate thus set a precedent nullifying the provision of Rule XVI.

In 1999, the Hutchison precedent was overturned (and the original effect of Rule XVI restored) when the Senate agreed to S.Res. 160, which states:

Resolved, That the presiding officer of the Senate should apply all precedents of the Senate under rule 16, in effect at the conclusion of the 103d Congress.

1996: FedEx precedent

Rule XXVIII, paragraph 3, of the Standing Rules of the Senate prohibits any matter outside the scope of a conference from being included in a conference report.

In 1996, during consideration of the conference report on the Federal Aviation Reauthorization Act of 1996, Majority Leader Trent Lott raised a point of order that the conference report exceeded the scope of the conference with respect to provisions relating to FedEx. After the point of order was sustained by the chair, Lott appealed the ruling of the chair. The Senate voted against sustaining the decision of the chair by a vote of 39–56. The Senate thus set a precedent nullifying the provision of Rule XXVIII.

In 2000, the FedEx precedent was overturned (and the original effect of Rule XXVIII restored) when Congress passed the Legislative Branch Appropriations Act for fiscal year 2001, which states, in relevant part:

SEC. 903. Beginning on the first day of the 107th Congress, the Presiding Officer of the Senate shall apply all of the precedents of the Senate under Rule XXVIII in effect at the conclusion of the 103d Congress.

2013: Cloture on nominations

On November 21, 2013, Majority Leader Harry Reid raised a point of order that "the vote on cloture under Rule XXII for all nominations other than for the Supreme Court of the United States is by majority vote." The presiding officer overruled the point of order, and the Senate voted 48–52 against sustaining the decision of the chair. The Senate therefore set a precedent that cloture can be invoked on nominations (except to the Supreme Court) by a simple majority, even though the plain text of the rule requires "three-fifths of the senators duly chosen and sworn" to invoke cloture. Three Democrats (Carl Levin, Joe Manchin and Mark Pryor) voted with all Republicans in favor of sustaining the decision of the chair. The text of Rule XXII was never changed.

Although the 60-vote threshold was eliminated for most nominations, nominations are still susceptible to being delayed by filibusters, and 60 votes were still required to invoke cloture on other questions such as legislation and Supreme Court nominations.
Rationale for change

The Democrats' stated motivation for this change was the perceived expansion of filibustering by Republicans during the Obama administration, in particular blocking three nominations to the United States Court of Appeals for the District of Columbia Circuit. Republicans had asserted that the D.C. Circuit was underworked, and also cited the need for cost reduction by reducing the number of judges in that circuit. At the time of the vote, 59 executive branch nominees and 17 judicial nominees were awaiting confirmation.

Prior to November 21, 2013, there had been only 168 cloture motions filed (or reconsidered) with regard to nominations. Nearly half of them (82) had been during the Obama administration. However, those cloture motions were often filed merely to speed things along, rather than in response to any filibuster. In contrast, there were just 38 cloture motions on nominations during the preceding eight years under President George W. Bush. Most of those cloture votes were successful. Obama won Senate confirmation for 30 out of 42 (71%) federal appeals court nominations, compared with Bush's 35 out of 52 (67%).

Regarding Obama's federal district court nominations, the Senate approved 143 out of 173 (83%) as of November 2013, compared to George W. Bush's first term 170 of 179 (95%), Bill Clinton's first term 170 of 198 (86%), and George H.W. Bush's 150 of 195 (77%). Filibusters were used on 20 of Obama's nominations to district court positions, but Republicans had allowed confirmation of 19 out of the 20 before the nuclear option was invoked.
2017: Cloture on Supreme Court nominations

On April 6, 2017, the Republican-majority Senate invoked the nuclear option and voted 48–52 along party lines against sustaining the decision of the chair on a point of order raised by Majority Leader Mitch McConnell, thus removing the Supreme Court exception created in 2013. This established a new precedent which allowed cloture to be invoked on Supreme Court nominations by a simple majority. The vote came after Senate Democrats filibustered the nomination of Neil Gorsuch to the Supreme Court of the United States.
2019: Postcloture time on low-level nominations

On April 3, 2019, in response to a perceived increase in postcloture filibusters by Senate Democrats on President Trump's executive and judicial nominations, the Republican-majority Senate voted 51-49 to overturn a ruling of the chair and thus set a precedent that postcloture debate on nominations—other than those to the Supreme Court of the United States, to the United States courts of appeals and to positions at Level I of the Executive Schedule—is two hours. All Republicans except Senators Susan Collins and Mike Lee voted against sustaining the decision of the chair.
2025 instances called "nuclear"

Senate Democrats accused the Republican majority under Majority Leader John Thune of exercising the nuclear option three times in 2025:

to allow consideration of joint resolutions of disapproval under the Congressional Review Act,
to allow the use of a current policy budget baseline for scoring of the One Big Beautiful Bill Act, and
to allow the consideration in executive session of a resolution allowing the Majority Leader to move to proceed to the en bloc consideration of multiple nominations.

Proposed use for legislation

Following elimination of the 60-vote rule for nominations in 2013, senators expressed concerns that the 60-vote rule will eventually be eliminated for legislation via the nuclear option.

While President, Donald Trump spoke out against the 60-vote requirement for legislation on several occasions. Then-Senate Majority Leader Mitch McConnell opposed abolishing the filibuster despite Trump's demands, and in April 2017, 61 senators (32 Republicans, 28 Democrats, and one independent) signed a letter stating their opposition to abolishing the filibuster for legislation. On January 21, 2018, Trump said on Twitter that if the shutdown stalemate continued, Republicans should consider the "nuclear option" in the Senate. He repeated the call on December 21, 2018, with a fresh shutdown looming.

Concerns about abolishing the filibuster through the nuclear option were reiterated in 2021 as the Democratic-majority Senate could attempt to eliminate the filibuster through the nuclear option. On January 3, 2022, Senate Majority Leader Chuck Schumer announced that the Senate would vote on using the nuclear option to reform the filibuster in order to pass his party's election reform legislation. On January 19, 2022, Schumer made a point of order that would have allowed a 'talking filibuster' on a voting rights bill without any other dilatory measures. The Senate voted 52–48 to sustain the decision of the chair overruling the point of order. Senators Joe Manchin and Kyrsten Sinema voted with all Republicans in favor of sustaining the decision of the chair.

In September 2024, then Vice President and presidential candidate Kamala Harris called for ending the filibuster to enact abortion legislation.
Other uses of "nuclear option"

After the appointment of Amy Coney Barrett to the Supreme Court of the United States, a proposed countermove to increase the size of the Supreme Court past nine seats (and thereby create new vacancies) was also described as a "nuclear option".

Beyond the specific context of the U.S. Senate, the term "nuclear option" has come to be used generically for a procedural maneuver with potentially serious consequences, to be used as a last resort to overcome political opposition. The term has been used in connection with procedural maneuvers in various state senates.

In a 2005 legal ruling on the validity of the Hunting Act 2004 the UK House of Lords, sitting in its judicial capacity, used "nuclear option" to describe the events of 1832, when the then-government threatened to create hundreds of new Whig peers to force the Tory-dominated Lords to accept the Reform Act 1832. (Nuclear weapons were not theorized until the 20th century, so the government's threat was not labeled as "nuclear" at the time.)

assistant:

Summary: The Nuclear Option in the U.S. Senate

The "nuclear option" is a procedural tactic in the U.S. Senate that allows the chamber to bypass a 60-vote filibuster threshold for changing Senate rules by a simple majority vote (51 senators), rather than the usual two-thirds supermajority. It was coined in 2003 and analogized to a last-resort nuclear strike.

Key Points:

  • Origin: The term was popularized in 2003, though the tactic was used earlier (e.g., 1995 Hutchison precedent).
  • Mechanism: A senator raises a point of order; the presiding officer rules; the ruling is appealed; and the Senate votes by a simple majority to overturn the chair’s decision. This sets a new precedent.
  • Use in Nominations:
    • 2013: Democrats used it to lower the cloture threshold for most nominations (excluding Supreme Court) to a simple majority.
    • 2017: Republicans extended it to Supreme Court nominations, overcoming a Democratic filibuster of Neil Gorsuch.
  • Legislative Filibuster: The option has been proposed to abolish the 60-vote rule for legislation, but it has not been successfully invoked for that purpose.
  • Recent Activity:
    • 2019: Republicans used it to limit post-cloture debate time on nominations.
    • 2025: Democrats accused Republicans of using the tactic three times to bypass filibusters.
  • Broader Use: The term is now used more broadly to describe any high-stakes procedural maneuver to break a legislative deadlock.

In essence, the nuclear option is a controversial but established tool in Senate history, used to override long-standing rules by a simple majority, often in response to partisan gridlock.

@ymcki
Copy link
Contributor

ymcki commented Jan 6, 2026

I uploaded a file 20,000 tokens, and it crashed and exited:( I will provide a more detailed log tomorrow.

I also find that running long context with the recurrent form seems to hit a hard limit in llama_graph and crashes. Now that chunking form is implemented, this issue seems to be gone.

@pwilkin
Copy link
Contributor

pwilkin commented Jan 6, 2026

I found that the optimal way to go in GGML is not chunking and recurrent, but actually chunking and autoregressive, with the autoregressive form for n_seq_tokens = 1 removing the entire code fragment needed for calculating decay with the SOLVE_TRI calculation. You can see in the newest code for Qwen3Next how this is done.

@ymcki
Copy link
Contributor

ymcki commented Jan 6, 2026

I found that the optimal way to go in GGML is not chunking and recurrent, but actually chunking and autoregressive, with the autoregressive form for n_seq_tokens = 1 removing the entire code fragment needed for calculating decay with the SOLVE_TRI calculation. You can see in the newest code for Qwen3Next how this is done.

Thanks for your comment.

Do you mean there is no need to change the chunking code?

The autoregressive code is expected to be faster than the recurrent code?

@pwilkin
Copy link
Contributor

pwilkin commented Jan 6, 2026

Autoregressive code skips all the operations needed to create the decay mask etc., so it's much faster for generation.

The chunking code can be optimized on the GGML level (factorizing the chunk-irrelevant operations before the chunk loop, removing unnecessary cont / transpose operations etc.). You can compare with the history of Qwen3 Next pulls to see how I did it :) but basically for the first PR, a correct version with chunking + autoregressive passes is enough, optimizations can come later.

@aarongerber
Copy link

Thanks to @cacaview to get this implementation started and @Aaryan-Kapoor's fixes, we finally have something that is working. However, looking at @pwilkin's Qwen3-Next implementation, I believe it is better to implement Kimi-Linear similarly such that the implementation can be backend agnostic without introducing any new ggml functions.

Here are a summary of what I have done:

  1. Implemented KDA's recurrent form with existing ggml functions only by extending @pwilkin's
    Qwen3-Next's delta_net_recurrent function. @cacaview's implementation is only equivalent

    I am hardly an expert here, so hopefully this isn’t the most foolish question, but have we regression tested Qwen3-Next to make sure changes don’t impact it?

@ymcki
Copy link
Contributor

ymcki commented Jan 6, 2026

Thanks to @cacaview to get this implementation started and @Aaryan-Kapoor's fixes, we finally have something that is working. However, looking at @pwilkin's Qwen3-Next implementation, I believe it is better to implement Kimi-Linear similarly such that the implementation can be backend agnostic without introducing any new ggml functions.

Here are a summary of what I have done:

  1. Implemented KDA's recurrent form with existing ggml functions only by extending @pwilkin's
    Qwen3-Next's delta_net_recurrent function. @cacaview's implementation is only equivalent

    I am hardly an expert here, so hopefully this isn’t the most foolish question, but have we regression tested Qwen3-Next to make sure changes don’t impact it?

What I said was that I learned from Qwen3 Next implementation for this implementation. So there should be no changes to Qwen3 Next.

@ymcki
Copy link
Contributor

ymcki commented Jan 6, 2026

I sync'ed my code from b7240 to b7243 for even easier review and merge. It seems like it still works with the ggufs I made b4. Please give it a try and see if there are any new bugs.

Will try to implement autoregressive form as suggested by pwilkin next.

@rhjdvsgsgks
Copy link
Contributor

rhjdvsgsgks commented Jan 6, 2026

@ymcki i found kimi linear have larger memory usage (4g) compare to qwen next (1g) on same context size. is that expected behavior?

response is also gibberish. on commit ymcki@30d883c with model downloaded from https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF/commit/d32f0993538b01d14e5470d1b2b50f297fd25498 . is there any info i need to provide in order to help your reproduce?

@ymcki
Copy link
Contributor

ymcki commented Jan 7, 2026

I sync'ed my code from b7240 to b7243 for even easier review and merge. It seems like it still works with the ggufs I made b4. Please give it a try and see if there are any new bugs.

Will try to implement autoregressive form as suggested by pwilkin next.

@ymcki i found kimi linear have larger memory usage (4g) compare to qwen next (1g) on same context size. is that expected behavior?

It is possible that Kimi Linear under the current implementation can use more VRAM for the new chunking and recurrent code and than Qwen3 Next because I used ggml_repeat and ggml_mul to replace ggml_mul_mat due to ggml only supports up to four dimensions.

But 4x seems too much. Can you tell me how to reproduce?

response is also gibberish. on commit ymcki@30d883c with model downloaded from https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF/commit/d32f0993538b01d14e5470d1b2b50f297fd25498 . is there any info i need to provide in order to help your reproduce?

When you say gibberish, do u mean u get it all the time or only for a specific prompt? If the latter, can you show me the prompt?

@pwilkin
Copy link
Contributor

pwilkin commented Jan 7, 2026

It is possible that Kimi Linear under the current implementation can use more VRAM for the new chunking and recurrent code and than Qwen3 Next because I used ggml_repeat and ggml_mul to replace ggml_mul_mat due to ggml only supports up to four dimensions.

You definitely don't want to do that. Qwen3 Next also has examples on how to pack the matrices for MUL_MAT and then unpack them again :)

@ymcki
Copy link
Contributor

ymcki commented Jan 7, 2026

Replaced bulid_kda_recurrent with build_kda_autoregressive. About 60% gain in inference.

Code is also sync'd to b7682.

recurrent form llama-bench pp ~450t/s ~tg 20t/s ./build/bin/llama-bench -m ~/Kimi-Linear-48B-A3B-Instruct-GGUF/Kimi-Linear-48B-A3B-Instruct.Q2_K.gguf -n 32 -d 8192 -b 64,128,256,512,1024,2048,4096,8192,16384 -ot shexp=CUDA0,"model.layers.0."=CUDA0 ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | n_batch | ot | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------------- | --------------: | -------------------: | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 64 | shexp=CUDA0 | pp512 @ d8192 | 408.36 ± 1.12 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 64 | shexp=CUDA0 | tg32 @ d8192 | 19.82 ± 0.05 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 128 | shexp=CUDA0 | pp512 @ d8192 | 358.01 ± 1.18 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 128 | shexp=CUDA0 | tg32 @ d8192 | 19.64 ± 0.22 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 256 | shexp=CUDA0 | pp512 @ d8192 | 418.22 ± 1.29 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 256 | shexp=CUDA0 | tg32 @ d8192 | 19.54 ± 0.22 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 512 | shexp=CUDA0 | pp512 @ d8192 | 452.92 ± 3.47 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 512 | shexp=CUDA0 | tg32 @ d8192 | 19.75 ± 0.08 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 1024 | shexp=CUDA0 | pp512 @ d8192 | 452.00 ± 4.07 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 1024 | shexp=CUDA0 | tg32 @ d8192 | 19.67 ± 0.12 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 2048 | shexp=CUDA0 | pp512 @ d8192 | 450.41 ± 4.37 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 2048 | shexp=CUDA0 | tg32 @ d8192 | 19.61 ± 0.08 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 4096 | shexp=CUDA0 | pp512 @ d8192 | 450.61 ± 3.60 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 4096 | shexp=CUDA0 | tg32 @ d8192 | 19.92 ± 0.05 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 8192 | shexp=CUDA0 | pp512 @ d8192 | 449.15 ± 3.80 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 8192 | shexp=CUDA0 | tg32 @ d8192 | 19.54 ± 0.15 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 16384 | shexp=CUDA0 | pp512 @ d8192 | 449.35 ± 3.07 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 16384 | shexp=CUDA0 | tg32 @ d8192 | 19.86 ± 0.12 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 64 | model.layers.0.=CUDA0 | pp512 @ d8192 | 399.53 ± 0.74 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 64 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.85 ± 0.06 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 128 | model.layers.0.=CUDA0 | pp512 @ d8192 | 351.87 ± 1.73 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 128 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.85 ± 0.15 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 256 | model.layers.0.=CUDA0 | pp512 @ d8192 | 413.28 ± 0.88 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 256 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.79 ± 0.08 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 512 | model.layers.0.=CUDA0 | pp512 @ d8192 | 446.92 ± 3.52 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 512 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.91 ± 0.08 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 1024 | model.layers.0.=CUDA0 | pp512 @ d8192 | 446.71 ± 2.79 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 1024 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.79 ± 0.12 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 2048 | model.layers.0.=CUDA0 | pp512 @ d8192 | 446.87 ± 3.15 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 2048 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.83 ± 0.08 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 4096 | model.layers.0.=CUDA0 | pp512 @ d8192 | 446.66 ± 3.05 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 4096 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.77 ± 0.08 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 8192 | model.layers.0.=CUDA0 | pp512 @ d8192 | 445.67 ± 3.94 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 8192 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.49 ± 0.34 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 16384 | model.layers.0.=CUDA0 | pp512 @ d8192 | 445.92 ± 4.00 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 16384 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.44 ± 0.20 |

build: 67bee56 (7243)

autoregressive form llama-bench pp ~450t/s ~tg 32t/s ./build/bin/llama-bench -m ~/Kimi-Linear-48B-A3B-Instruct-GGUF/Kimi-Linear-48B-A3B-Instruct.Q2_K.gguf -n 32 -d 8192 -b 64,128,256,512,1024,2048,4096,8192,16384 -ot shexp=CUDA0,"model.layers.0."=CUDA0 ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | n_batch | ot | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------------- | --------------: | -------------------: | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 64 | shexp=CUDA0 | pp512 @ d8192 | 349.81 ± 1.44 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 64 | shexp=CUDA0 | tg32 @ d8192 | 30.83 ± 2.42 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 128 | shexp=CUDA0 | pp512 @ d8192 | 353.50 ± 1.19 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 128 | shexp=CUDA0 | tg32 @ d8192 | 30.67 ± 0.46 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 256 | shexp=CUDA0 | pp512 @ d8192 | 414.68 ± 1.34 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 256 | shexp=CUDA0 | tg32 @ d8192 | 31.82 ± 0.18 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 512 | shexp=CUDA0 | pp512 @ d8192 | 449.73 ± 3.58 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 512 | shexp=CUDA0 | tg32 @ d8192 | 31.80 ± 0.40 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 1024 | shexp=CUDA0 | pp512 @ d8192 | 451.11 ± 2.85 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 1024 | shexp=CUDA0 | tg32 @ d8192 | 31.67 ± 0.50 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 2048 | shexp=CUDA0 | pp512 @ d8192 | 451.03 ± 4.71 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 2048 | shexp=CUDA0 | tg32 @ d8192 | 30.58 ± 1.91 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 4096 | shexp=CUDA0 | pp512 @ d8192 | 450.53 ± 3.25 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 4096 | shexp=CUDA0 | tg32 @ d8192 | 31.60 ± 0.45 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 8192 | shexp=CUDA0 | pp512 @ d8192 | 448.78 ± 3.66 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 8192 | shexp=CUDA0 | tg32 @ d8192 | 31.29 ± 1.03 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 16384 | shexp=CUDA0 | pp512 @ d8192 | 449.71 ± 3.19 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 16384 | shexp=CUDA0 | tg32 @ d8192 | 31.51 ± 0.48 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 64 | model.layers.0.=CUDA0 | pp512 @ d8192 | 348.25 ± 0.75 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 64 | model.layers.0.=CUDA0 | tg32 @ d8192 | 31.92 ± 0.10 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 128 | model.layers.0.=CUDA0 | pp512 @ d8192 | 350.55 ± 2.45 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 128 | model.layers.0.=CUDA0 | tg32 @ d8192 | 30.88 ± 2.09 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 256 | model.layers.0.=CUDA0 | pp512 @ d8192 | 413.78 ± 0.99 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 256 | model.layers.0.=CUDA0 | tg32 @ d8192 | 30.85 ± 2.32 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 512 | model.layers.0.=CUDA0 | pp512 @ d8192 | 448.86 ± 3.44 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 512 | model.layers.0.=CUDA0 | tg32 @ d8192 | 31.93 ± 0.11 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 1024 | model.layers.0.=CUDA0 | pp512 @ d8192 | 449.72 ± 4.05 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 1024 | model.layers.0.=CUDA0 | tg32 @ d8192 | 32.14 ± 0.14 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 2048 | model.layers.0.=CUDA0 | pp512 @ d8192 | 448.00 ± 5.42 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 2048 | model.layers.0.=CUDA0 | tg32 @ d8192 | 31.83 ± 0.39 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 4096 | model.layers.0.=CUDA0 | pp512 @ d8192 | 448.81 ± 3.38 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 4096 | model.layers.0.=CUDA0 | tg32 @ d8192 | 31.90 ± 0.40 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 8192 | model.layers.0.=CUDA0 | pp512 @ d8192 | 447.99 ± 3.44 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 8192 | model.layers.0.=CUDA0 | tg32 @ d8192 | 31.91 ± 0.14 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 16384 | model.layers.0.=CUDA0 | pp512 @ d8192 | 448.73 ± 2.95 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 16384 | model.layers.0.=CUDA0 | tg32 @ d8192 | 31.39 ± 1.15 |

build: 40f6118 (7682)

@pwilkin
Copy link
Contributor

pwilkin commented Jan 7, 2026

Replaced bulid_kda_recurrent with build_kda_autoregressive. About 60% gain in inference.

Code is also sync'd to b7682.
recurrent form llama-bench pp ~450t/s ~tg 20t/s

autoregressive form llama-bench pp ~450t/s ~tg 32t/s

Yeah, now you do the same thing that's done in Qwen3Next: vary which form is used according to n_seq_tokens.

@ymcki
Copy link
Contributor

ymcki commented Jan 7, 2026

It is possible that Kimi Linear under the current implementation can use more VRAM for the new chunking and recurrent code and than Qwen3 Next because I used ggml_repeat and ggml_mul to replace ggml_mul_mat due to ggml only supports up to four dimensions.

You definitely don't want to do that. Qwen3 Next also has examples on how to pack the matrices for MUL_MAT and then unpack them again :)

The problem is that the cumsum term of KDA has an extra dimension of S_k. It doesn't seem to me it is possible to mul_mat two [chunk_size,chunk_size,S_k,CHB] tensors as mul_mat only works on the first two dimensions.

    const int64_t CHB = n_chunks * H_v * n_seqs;
    ggml_tensor * g_i = ggml_reshape_4d(ctx0, gk_cumsum, chunk_size, 1, S_k, CHB);
    ggml_tensor * g_j = ggml_reshape_4d(ctx0, gk_cumsum, 1, chunk_size, S_k, CHB);
    ggml_tensor * g_j_bc = ggml_repeat_4d(ctx0, g_j, chunk_size, chunk_size, S_k, CHB);
    ggml_tensor * decay_mask = ggml_sub(ctx0, g_j_bc, g_i);

    decay_mask = ggml_mul(ctx0, decay_mask, diag_mask);
    decay_mask = ggml_exp(ctx0, decay_mask);
    decay_mask = ggml_mul(ctx0, decay_mask, diag_mask);

    ggml_tensor * k_per = ggml_cont(ctx0, ggml_permute(ctx0, k, 1, 0, 2, 3));
    ggml_tensor * k_i = ggml_reshape_4d(ctx0, k_per, chunk_size, 1, S_k, CHB);
    ggml_tensor * k_i_bc = ggml_repeat_4d(ctx0, k_i, chunk_size, chunk_size, S_k, CHB);
    ggml_tensor * k_j = ggml_reshape_4d(ctx0, k_per, 1,  chunk_size, S_k, CHB);
    ggml_tensor * k_j_bc = ggml_repeat_4d(ctx0, k_j, chunk_size, chunk_size, S_k, CHB);

    ggml_tensor * Akk = ggml_mul(ctx0, decay_mask, k_j_bc);
    Akk = ggml_mul(ctx0, Akk, k_i_bc);

    Akk = ggml_cont(ctx0, ggml_permute(ctx0, Akk, 1, 2, 0, 3));
    Akk = ggml_sum_rows(ctx0, Akk);
    Akk = ggml_reshape_4d(ctx0, Akk, chunk_size, chunk_size, n_chunks, H_k * n_seqs);

If I use clamp, it is possible to use mul_mat but then the solution is an approximation and not exact.

Would it work if I somehow pack them into [chunk_size*chunk_size,S_k,n_chunks,HB]???

@pwilkin
Copy link
Contributor

pwilkin commented Jan 7, 2026

The problem is that the cumsum term of KDA has an extra dimension of S_k. It doesn't seem to me it is possible to mul_mat two [chunk_size,chunk_size,S_k,CHB] tensors as mul_mat only works on the first two dimensions.

Nah, that's not how it works.

MUL_MAT works in all dimensions, but the 3rd and 4th are batch dimensions, so their role is technically a "foreach" loop.

That's why you can generally squish the batch dimensions without any problems. As a general rule, you can always (w.r.t. correctness) do ggml_reshape_3d(ctx, tensor, dim_A, dim_B, all_batch_dims_in_one), but that might not always be efficient.

@ymcki
Copy link
Contributor

ymcki commented Jan 8, 2026

Committed a mul_mat with clamp version. VRAM usage reduced from 4GB to 3.6GB with the command below. Is this good enough?

./build/bin/llama-cli -m ~/Kimi-Linear-48B-A3B-Instruct-GGUF/Kimi-Linear-48B-A3B-Instruct.Q4_K_M.gguf -c 8192 -cmoe -ngl 100

@ymcki
Copy link
Contributor

ymcki commented Jan 9, 2026

Committed a mul_mat without clamp version that should give exact solutions. Code sync'd to b7712.

@ymcki
Copy link
Contributor

ymcki commented Jan 11, 2026

I noticed that the version of @cacaview fixed by @Aaryan-Kapoor does not yet have
MLA KV cache and it runs with MHA KV cache. So I followed the original MLA PR to
add MLA KV cache support and committed the code to my github branch. This will
reduce 1M token VRAM usage from 140GB to 14.875GB.
#12801
https://github.com/ymcki/llama.cpp/tree/Kimi-Linear

GGUFs needs to be regenerated to enable MLA KV cache. I updated the ggufs in HF repo
accordinigly.
https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF/

The old ggufs generated won't have MLA KV cache in the new implementation but my code is
backward compatible, so they can still run with MHA KV cache.

I believe now my implementation is now complete for an initial version. It would
be great if reviewers can take a look and tell me what else I can do.

This is a Q4_K_M reply from a query of the 180K wikipedia article:
https://ja.wikipedia.org/wiki/%E4%B8%89%E5%B3%B6%E7%94%B1%E7%B4%80%E5%A4%AB
I asked the LLM to generate a timeline of the whole life of a Japanese novelist
named Mishima Yukio.

次は三島由紀夫のウィキペディア記事です。これに基づいて、彼の生涯の出来事を年表でまとめ>てください。 三島由紀夫の生涯は、戦前から戦後に至る日本の激動の時代と深く結びついており、文学的業績だけでなく 政治思想や行動にも顕著な影響を与えた人物です。以下に、彼の生涯の主な出来事を年表形式でまとめます 。

1925年(大正14年)

  • 1月14日:東京市四谷区永住町(現・新宿区四谷)で誕生。本名は平岡公威。
  • 祖父・平岡定太郎(樺太庁長官)と父・平岡梓(農林省官僚)の家族に生まれる。
  • 幼少期は祖母・夏子の影響を強く受ける。夏子は strict な教育方針を取り、歌舞伎や文芸に触れさせる
    ことで三島の美学の基礎を築く。

1938年(昭和13年)

  • 1月:初めての短編小説『酸模――秋彦の幼き思ひ出』を発表(筆名「三島由紀夫」)。
  • 4月:学習院中等科に進学。文芸部に入り、同級生・坊城俊民と出会い文学の道を歩み始める。

1941年(昭和16年)

  • 『花ざかりの森』を発表し、文壇にデビュー。筆名の由来は「三島(富士山)」と「由紀(新嘗祭の神事
    に使われる米)」。
  • 川端康成に影響を受け、日本浪曼派の文学的系譜に属していく。

1945年(昭和20年)

  • 終戦:日本の敗戦を知り、戦争の虚無感や死の観念を強く意識する。
  • 10月23日:妹・美津子が腸チフスで17歳で死去。彼にとって大きな打撃となる。
  • 11月:三谷邦子との恋愛が勃發し、後の自伝的小説『仮面の告白』に影響を与える。

1946年(昭和21年)

  • 1月:川端康成と出会い、師弟関係を築く。川端は三島の文学的才能を高く評価し、彼の出発点とな>る。
  • 5月5日:恋人・三谷邦子が銀行員・永井邦夫と結婚。三島は深い打撃を受ける。

1947年(昭和22年)

  • 11月28日:東京大学法学部を卒業。
  • 12月13日:高等文官試験に合格し、大蔵省(財務省)に勤務開始。
  • 12月24日:『岬にての物語』を発表。作家としての地位が確立される。

1948年(昭和23年)

  • 9月2日:大蔵省を辞職し、作家としての生活を始める。
  • 11月:初の長編『盗賊』を刊行。
  • 12月:短編集『夜の仕度』を刊行。

1949年(昭和24年)

  • 7月5日:『仮面の告白』を刊行。自己分析的な私小説として大きな話題を呼ぶ。
  • この頃から、戯曲家としても活動を広げていく。

1951年(昭和26年)

  • 1月:『禁色(第一部)』を『群像』に連載開始。同性愛をテーマにした作品で大きな議論を呼ぶ。
  • 5月:初の世界旅行(アメリカ・欧州・アジア)に出発。ギリシャ・アテネで古典美に感銘を受ける>。

1954年(昭和29年)

  • 6月:『潮騒』を刊行。ベストセラーとなり、三船敏郎が出演する映画化も行われる。
  • この作品で初の文学賞(新潮社文学賞)を受賞。

1956年(昭和31年)

  • 1月:『金閣寺』を『新潮』に連載開始。
  • 10月:刊行。三島文学の金字塔ともされる代表作となる。
  • この年、読売文学賞を受賞。

1961年(昭和36年)

  • 1月:『憂国』を発表。二・二六事件を題材にした小説で、政治的な作品群の一つ。
  • 4月:剣道初段に合格。以後、剣道やボディビルに没頭する。

1965年(昭和40年)

  • 11月:『サド侯爵夫人』を劇団NLTで上演。フランスでも人気を博す。
  • **この年、ノーベル文学賞候補に名を連ねるも、最終的には受賞できず。

1968年(昭和43年)

  • 10月5日:「楯の会」を結成。皇国主義的な政治活動に参加し、学生との交流を行う。
  • この年、『文化防衛論』などの政治評論を発表。

1970年(昭和45年)

  • 11月25日:楯の会メンバー4名と共に自衛隊市ヶ谷駐屯地に突入し、総監益田兼利を人質に取り、演>説後に自衛隊員に向けてクーデターを呼びかける
  • **その後、割腹自殺を遂げる(45歳没)。**この事件(「三島事件」)は社会に大きな衝撃を与え、日本
    の政治・文学界に大きな影響を及ぼした。

死後の評価

  • 三島の死後、彼の作品は日本国内外で高く評価され続け、ノーベル文学賞候補にもなった。
  • 彼の思想や行動は「右翼」とも「保守思想家」とも言われるが、実際には古典主義的な美学と近代国家
    論を掲げる異端の思想家
    として語られる。

このように、三島由紀夫の生涯は、**文学と政治・行動を組み合わせた、日本の戦後史に深く刻まれた人物
**として語られています。彼の死は「昭和の終焉」を象徴する出来事とされ、後世に多大な影響を与え続け
ています。

@IIIIIllllIIIIIlllll
Copy link

I tried to test it, but it gave me an error. @ymcki

mark@MarkPC:~/llama.cpp/llama.cpp-Kimi-Linear/build/bin$ ./llama-cli --version
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
version: 0 (unknown)
built with GNU 15.2.0 for Linux x86_64
启动命令:/home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/llama-server -m /home/mark/Models/Q4/Kimi-Linear-48B-A3B-Instruct.Q4_K_M/Kimi-Linear-48B-A3B-Instruct.Q4_K_M.gguf --port 8084 --ctx-size 8192 --flash-attn on --no-mmap --temp 0.8 --top-p 0.9 --top-k 40 --min-p 0.1 --presence-penalty 0.0 --repeat-penalty 1.0 --frequency-penalty 0.0 --batch-size 2048 --ubatch-size 512 --parallel -1 --cache-ram 8192 --cache-type-k f16 --cache-type-v f16 --threads -1 --seed -1 --no-direct-io --no-webui --metrics --slot-save-path /home/mark/App/llama.cpp/cache
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 0 (unknown) with GNU 15.2.0 for Linux x86_64
system info: n_threads = 32, n_threads_batch = 32, total_threads = 32

system_info: n_threads = 32 (n_threads_batch = 32) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

init: using 31 threads for HTTP server
Web UI is disabled
start: binding port with default address family
main: loading model
srv    load_model: loading model '/home/mark/Models/Q4/Kimi-Linear-48B-A3B-Instruct.Q4_K_M/Kimi-Linear-48B-A3B-Instruct.Q4_K_M.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
/home/mark/llama.cpp/llama.cpp-Kimi-Linear/ggml/src/ggml.c:3829: GGML_ASSERT(a->ne[0] == b->ne[0]) failed
[New LWP 13984]
[New LWP 13981]
[New LWP 13980]
[New LWP 13979]
[New LWP 13978]
[New LWP 13977]
[New LWP 13976]
[New LWP 13975]
[New LWP 13974]
[New LWP 13973]
[New LWP 13972]
[New LWP 13971]
[New LWP 13970]
[New LWP 13969]
[New LWP 13968]
[New LWP 13967]
[New LWP 13966]
[New LWP 13965]
[New LWP 13964]
[New LWP 13963]
[New LWP 13962]
[New LWP 13961]
[New LWP 13960]
[New LWP 13959]
[New LWP 13958]
[New LWP 13957]
[New LWP 13956]
[New LWP 13955]
[New LWP 13954]
[New LWP 13953]
[New LWP 13952]
[New LWP 13951]
[New LWP 13950]
[New LWP 13949]
[New LWP 13946]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56	../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: 没有那个文件或目录
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56	in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x0000733f02ca013c in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
warning: 49	./nptl/cancellation.c: 没有那个文件或目录
#2  __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75	in ./nptl/cancellation.c
#3  0x0000733f02d1c98f in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: 没有那个文件或目录
#4  0x0000733f0336ca13 in ggml_print_backtrace () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libggml-base.so.0
#5  0x0000733f0336cbc6 in ggml_abort () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libggml-base.so.0
#6  0x0000733f03374223 in ggml_set_rows () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libggml-base.so.0
#7  0x0000733f034f189f in llm_graph_context::build_attn(llm_graph_input_attn_kv*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, float, int) const () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#8  0x0000733f0360a107 in llm_build_kimi_linear::llm_build_kimi_linear(llama_model const&, llm_graph_params const&) () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#9  0x0000733f03530b53 in llama_model::build_graph(llm_graph_params const&) const () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#10 0x0000733f034b4ff6 in llama_context::graph_reserve(unsigned int, unsigned int, unsigned int, llama_memory_context_i const*, bool, unsigned long*) () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#11 0x0000733f034b9ee9 in llama_context::llama_context(llama_model const&, llama_context_params) () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#12 0x0000733f034bab27 in llama_init_from_model () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#13 0x0000733f0348f3d1 in llama_get_device_memory_data(char const*, llama_model_params const*, llama_context_params const*, std::vector<ggml_backend_device*, std::allocator<ggml_backend_device*> >&, unsigned int&, unsigned int&, unsigned int&, ggml_log_level) () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#14 0x0000733f03490554 in llama_params_fit_impl(char const*, llama_model_params*, llama_context_params*, float*, llama_model_tensor_buft_override*, unsigned long*, unsigned int, ggml_log_level) () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#15 0x0000733f03494352 in llama_params_fit () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#16 0x00005dbc3ddc7915 in common_init_result::common_init_result(common_params&) ()
#17 0x00005dbc3ddca32a in common_init_from_params(common_params&) ()
#18 0x00005dbc3dc6a281 in server_context_impl::load_model(common_params const&) ()
#19 0x00005dbc3dbc24fe in main ()

@cacaview
Copy link
Author

我尝试测试了一下,但出现了错误。@ymcki

mark@MarkPC:~/llama.cpp/llama.cpp-Kimi-Linear/build/bin$ ./llama-cli --version
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
version: 0 (unknown)
built with GNU 15.2.0 for Linux x86_64
启动命令:/home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/llama-server -m /home/mark/Models/Q4/Kimi-Linear-48B-A3B-Instruct.Q4_K_M/Kimi-Linear-48B-A3B-Instruct.Q4_K_M.gguf --port 8084 --ctx-size 8192 --flash-attn on --no-mmap --temp 0.8 --top-p 0.9 --top-k 40 --min-p 0.1 --presence-penalty 0.0 --repeat-penalty 1.0 --frequency-penalty 0.0 --batch-size 2048 --ubatch-size 512 --parallel -1 --cache-ram 8192 --cache-type-k f16 --cache-type-v f16 --threads -1 --seed -1 --no-direct-io --no-webui --metrics --slot-save-path /home/mark/App/llama.cpp/cache
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 0 (unknown) with GNU 15.2.0 for Linux x86_64
system info: n_threads = 32, n_threads_batch = 32, total_threads = 32

system_info: n_threads = 32 (n_threads_batch = 32) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

init: using 31 threads for HTTP server
Web UI is disabled
start: binding port with default address family
main: loading model
srv    load_model: loading model '/home/mark/Models/Q4/Kimi-Linear-48B-A3B-Instruct.Q4_K_M/Kimi-Linear-48B-A3B-Instruct.Q4_K_M.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
/home/mark/llama.cpp/llama.cpp-Kimi-Linear/ggml/src/ggml.c:3829: GGML_ASSERT(a->ne[0] == b->ne[0]) failed
[New LWP 13984]
[New LWP 13981]
[New LWP 13980]
[New LWP 13979]
[New LWP 13978]
[New LWP 13977]
[New LWP 13976]
[New LWP 13975]
[New LWP 13974]
[New LWP 13973]
[New LWP 13972]
[New LWP 13971]
[New LWP 13970]
[New LWP 13969]
[New LWP 13968]
[New LWP 13967]
[New LWP 13966]
[New LWP 13965]
[New LWP 13964]
[New LWP 13963]
[New LWP 13962]
[New LWP 13961]
[New LWP 13960]
[New LWP 13959]
[New LWP 13958]
[New LWP 13957]
[New LWP 13956]
[New LWP 13955]
[New LWP 13954]
[New LWP 13953]
[New LWP 13952]
[New LWP 13951]
[New LWP 13950]
[New LWP 13949]
[New LWP 13946]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56	../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: 没有那个文件或目录
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56	in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x0000733f02ca013c in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
warning: 49	./nptl/cancellation.c: 没有那个文件或目录
#2  __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75	in ./nptl/cancellation.c
#3  0x0000733f02d1c98f in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: 没有那个文件或目录
#4  0x0000733f0336ca13 in ggml_print_backtrace () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libggml-base.so.0
#5  0x0000733f0336cbc6 in ggml_abort () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libggml-base.so.0
#6  0x0000733f03374223 in ggml_set_rows () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libggml-base.so.0
#7  0x0000733f034f189f in llm_graph_context::build_attn(llm_graph_input_attn_kv*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, float, int) const () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#8  0x0000733f0360a107 in llm_build_kimi_linear::llm_build_kimi_linear(llama_model const&, llm_graph_params const&) () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#9  0x0000733f03530b53 in llama_model::build_graph(llm_graph_params const&) const () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#10 0x0000733f034b4ff6 in llama_context::graph_reserve(unsigned int, unsigned int, unsigned int, llama_memory_context_i const*, bool, unsigned long*) () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#11 0x0000733f034b9ee9 in llama_context::llama_context(llama_model const&, llama_context_params) () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#12 0x0000733f034bab27 in llama_init_from_model () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#13 0x0000733f0348f3d1 in llama_get_device_memory_data(char const*, llama_model_params const*, llama_context_params const*, std::vector<ggml_backend_device*, std::allocator<ggml_backend_device*> >&, unsigned int&, unsigned int&, unsigned int&, ggml_log_level) () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#14 0x0000733f03490554 in llama_params_fit_impl(char const*, llama_model_params*, llama_context_params*, float*, llama_model_tensor_buft_override*, unsigned long*, unsigned int, ggml_log_level) () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#15 0x0000733f03494352 in llama_params_fit () from /home/mark/llama.cpp/llama.cpp-Kimi-Linear/build/bin/libllama.so.0
#16 0x00005dbc3ddc7915 in common_init_result::common_init_result(common_params&) ()
#17 0x00005dbc3ddca32a in common_init_from_params(common_params&) ()
#18 0x00005dbc3dc6a281 in server_context_impl::load_model(common_params const&) ()
#19 0x00005dbc3dbc24fe in main ()

Please compile the CPU version. In the current PR, we will only discuss the implementation related to the CPU version.

@ymcki
Copy link
Contributor

ymcki commented Jan 11, 2026

/home/mark/llama.cpp/llama.cpp-Kimi-Linear/ggml/src/ggml.c:382

Are you using an outdated convert_hf_to_gguf.py to generate your ggufs?

I got this error if I used the old ones without these fives lines:

5125        # note: To enable MLA KV cache, attention needs to be converted into MQA (ie: GQA with 1 group)
5126        self.hparams["num_key_value_heads"] = 1
5181        # To enable MLA KV cache, MLA needs to be converted into MQA with larger heads, then decompresses to MHA
5182        self.gguf_writer.add_key_length(self.hparams["kv_lora_rank"] + self.hparams["qk_rope_head_dim"])
5183        self.gguf_writer.add_value_length(self.hparams["kv_lora_rank"])

They are essential to trigger the use of MLA KV cache. Otherwise, you will get this set_rows assertion in cpy_k.

@IIIIIllllIIIIIlllll
Copy link

Are you using an outdated convert_hf_to_gguf.py to generate your ggufs?

I download it from https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF
There might be a problem with the downloaded file. I'll try again.

@ymcki
Copy link
Contributor

ymcki commented Jan 11, 2026

Are you using an outdated convert_hf_to_gguf.py to generate your ggufs?

I download it from https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF There might be a problem with the downloaded file. I'll try again.

I am sorry the Q4_K_M gguf I uploaded is outdated. I am re-uploading it now. Meanwhile, you can try Q2_K

@jacekpoplawski
Copy link
Contributor

Shouldn't this be closed now?

@pwilkin pwilkin closed this Feb 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.