Conversation
- Implement KDA layer (linear attention with gates and decay) - Implement MLA layer (multi-head latent attention with KV compression) - Support MoE FFN with shared experts - Add TikToken tokenizer support for Kimi models - Fix vocab loading for large vocabularies - Model loads and runs inference (27 layers, 603 tensors)
- Add missing MoE metadata to GGUF conversion: - moe_intermediate_size (1024) - num_shared_experts (1) - first_k_dense_replace (1) - routed_scaling_factor (2.446) - expert_gating_func (sigmoid) - Fix MoE gating function default to SIGMOID (was SOFTMAX) - Add expert_weights_scale loading with default 2.446 - Enable moe_renormalize (norm_w=true) in build_moe_ffn - Add fallback for exp_probs_b tensor suffix compatibility
- Add KDA (Kimi Delta Attention) CUDA kernel (kda-scan.cu) - Fix recurrence order: decay first, then retrieval - Verify CPU/CUDA implementation consistency - Support head_dim=128, L2 normalization for Q/K
| # KimiLinearModel is defined later in this file (line ~5140) as a TextModel subclass | ||
| # This old definition has been removed to avoid conflicts | ||
|
|
||
|
|
||
| @ModelBase.register( |
There was a problem hiding this comment.
| # KimiLinearModel is defined later in this file (line ~5140) as a TextModel subclass | |
| # This old definition has been removed to avoid conflicts | |
| @ModelBase.register( | |
| @ModelBase.register( |
| @@ -5108,8 +5116,298 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter | |||
| (self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_K, bid), k), | |||
| (self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_V, bid), v), | |||
| ] | |||
There was a problem hiding this comment.
| ] | |
| ] | |
| else: | |
| return [(self.map_tensor_name(name), data_torch)] |
convert_hf_to_gguf.py
Outdated
| @ModelBase.register("KimiLinearModel", "KimiLinearForCausalLM") | ||
| class KimiLinearModel(TextModel): | ||
| """Kimi-Linear model with hybrid MLA+KDA architecture""" | ||
| model_arch = gguf.MODEL_ARCH.KIMI |
There was a problem hiding this comment.
| model_arch = gguf.MODEL_ARCH.KIMI | |
| model_arch = gguf.MODEL_ARCH.KIMI_LINEAR |
| _experts: list[dict[str, Tensor]] | None = None | ||
|
|
||
| def set_gguf_parameters(self): | ||
| self.gguf_writer.add_vocab_size(self.hparams["vocab_size"]) |
There was a problem hiding this comment.
| self.gguf_writer.add_vocab_size(self.hparams["vocab_size"]) | |
| super().set_gguf_parameters() | |
| self.gguf_writer.add_vocab_size(self.hparams["vocab_size"]) |
| # Use find_hparam for context length | ||
| # Kimi uses model_max_length | ||
| n_ctx = self.find_hparam(["max_position_embeddings", "model_max_length", "n_ctx", "n_positions"], optional=True) | ||
| if n_ctx is not None: | ||
| self.gguf_writer.add_context_length(n_ctx) | ||
| else: | ||
| return [(self.map_tensor_name(name), data_torch)] | ||
| # Default to 4096 if not found | ||
| logger.warning("No context length found in config, defaulting to 4096") | ||
| self.gguf_writer.add_context_length(4096) |
There was a problem hiding this comment.
Add model_max_length to TextModel.set_gguf_parameters instead, the fallback is not necessary.
|
I have fixed these errors in the commit at cacaview@780dd78 |
Please address the remaining unresolved ones as well. |
|
I conducted some simple tests and encountered some issues. The root causes are still unclear. Test Environment
Test LogsTest 1: Simple GreetingTest 2: Simple MathIncorrect calculation: 25 + 37 = 62, not 50. Test 3: Knowledge Q&ATest 4: Chinese TestChinese input encountered encoding issues in PowerShell, and the model failed to process Chinese correctly. Test 5: Code GenerationTest 6: Concept Explanation (Repetitive Output)Severe repetitive output issue occurred. Test 7: Logical ReasoningIncorrect logical reasoning. The correct answer should be "Cannot be determined". |
|
@CISC is earlier post valid method to check correctness of models implementation? Used in earlier post. Edit: quote removed. Reference added for proper greedy decoding. |
|
@engrtipusultan please don't quote huge posts like that, makes the thread super hard to read. The steps to verify model conversion faithfulness are:
|
|
It would be great if someone has a high-end server or workstation to look into this issue. The 48B model is extremely large, making it very difficult to debug on my computer. |
|
@cacaview https://gist.github.com/pwilkin/2b917bed6bbabe9fcefa14f7fe7a4bd2 <= you can use this to create a small mock Kimi model, which you can then convert and compare tensor dumps. |
Are you still working on it ? |
Add debug dump points throughout the KDA and MLA layers to enable tensor inspection during inference: KDA Layer: - Conv states (q, k, v) before processing - Q, K, V after conv1d + SiLU - SSM state before and after KDA scan - Output gate (g2) MLA Layer: - Added detailed comments mapping tensor names to vLLM equivalents - Q projection, KV compression, attention output These callbacks help verify correctness against reference implementations.
…view/llama.cpp into feature/kimi-linear-support
|
It's a bit odd—after recompiling and testing, the previous output chaos issue cannot be reproduced. Tests Passed:
Known Issues:
Currently, tests have only been conducted on CPU and CUDA. |
|
Just a heads up though: without chunking, tri_solve will be very slow on batch size 512 and impossibly slow on batch size 1024 (check with a longer prompt and -b 1024). |
|
I uploaded a file 20,000 tokens, and it crashed and exited:( |
|
I just committed an implementation of chunking form. It is the naive implementation based on Here are two examples that trigger the chunking mode. One with 74 tokens (>=65 needed to Testing much longer prompts is beyond the capability of my machine. Why could Philip IV of France install Pope of his liking that resulted in Avignon Papacy? He then used the power of the Pope to destroy Knights Templar. How come other Christian countries could not prevent Avignon Papacy? How come they didn't resist and save Knights Templar from prosecution? Please write a 6,000 words essay about this topic.**The Avignon Papacy and the Fall of the Knights Templar: A Comprehensive Analysis of Political Realpolitik in 14th-Century Christendom**Introduction The period from 1305 to 1377, known as the Avignon Papacy, represents one of the most significant ruptures in the history of the Catholic Church. During this time, the papacy was relocated from Rome to Avignon, France, under the influence of Philip IV of France, known as Philip the Fair. The papacy's move to Avignon was not merely a symbolic shift; it was a calculated political maneuver that fundamentally altered the balance of power between the Papacy, secular monarchs, and European Christendom. The Avignon Papacy was marked by the appointment of popes favorable to the French crown, the manipulation of papal authority to suppress dissenting orders such as the Knights Templar, and the erosion of papal independence from secular influence. This essay explores the question: Why could Philip IV of France install a pope of his liking, leading to the Avignon Papacy, and how did this enable the destruction of the Knights Templar? Additionally, we will examine why other Christian monarchies did not resist this consolidation of papal authority under French control, and why the Templars, despite their wealth, military prowess, and widespread influence, could not withstand the coordinated attack launched against them. I. The Context: Papal Authority in the Late 13th Century To understand the Avignon Papacy, we must first consider the political and institutional context of the late 13th century. The Papacy had long claimed spiritual supremacy over Christendom, but its actual influence was often contested by secular rulers. The Investiture Controversy of the 11th and 12th centuries had already demonstrated the tensions between papal authority and monarchical power. By the 13th century, the Papacy had regained some prestige through the efforts of Pope Innocent III (1198–1216), who asserted papal supremacy over kings and emperors. However, the Papacy remained vulnerable to political pressure, especially from powerful monarchs like the Capetian kings of France. II. Philip IV and the Control of the Papacy: 1285–1314 Philip IV, born in 1268, came to the throne in 1285 and quickly established himself as one of the most powerful monarchs in Europe. His reign marked a turning point in the relationship between the French monarchy and the Papacy. Unlike his predecessors, Philip was not content to be a vassal of the Pope; he sought to dominate the Papacy, not merely influence it. A. The Election of 1305 and the Selection of Clement V Philip’s first major move was to influence the papal election of 1305. When Pope Boniface VIII died in 1303, Philip had been instrumental in his downfall and eventual death. Boniface’s successor, Benedict XI, lasted only eight months before dying under suspicious circumstances. Philip then pressured the College of Cardinals to elect a Frenchman, Bertrand de Got, as pope. Clement V was crowned on November 14, 1305, in Lyon, and immediately after his election, he issued a series of bulls that placated Philip, including the bull Pastoralis praeeminentiae (1302), which asserted papal authority over all Christendom. Clement V’s reign was marked by political expediency. He avoided Rome, never returned to the city, and instead established a papal court in Avignon. This move was not merely logistical; it was a symbolic assertion of papal subordination to French authority. Avignon, a papal enclave surrounded by French territory, became a gilded cage for the papacy. B. Financial and Legal Control Philip IV did not stop at influencing papal elections. He also sought to control the financial and legal operations of the Papacy. Through the Papal Taxation of 1306, Philip imposed a new tax system on the clergy of France, effectively subjecting the Papacy to French financial oversight. The Papacy, dependent on French resources, became increasingly beholden to Philip’s will. III. The Avignon Papacy: Political and Religious Implications The Avignon Papacy lasted from 1305 to 1377, during which time seven French popes ruled from Avignon. This period was characterized by a marked decline in papal prestige and independence. A. Centralization and Bureaucratization The Avignon Papacy centralized papal administration and bureaucratized its operations. This made the Papacy more efficient but also more susceptible to political manipulation. The popes of Avignon were more concerned with temporal power than spiritual leadership, leading to widespread criticism from reformers and laypeople alike. B. Weakening of Papal Authority The Avignon Papacy weakened the Papacy’s moral authority. The Church became associated with French national interests, and papal decisions were increasingly seen as politically motivated. This erosion of spiritual credibility made it easier for monarchs like Philip IV to justify suppressing religious orders that threatened their authority. IV. The Knights Templar: A Threat to Royal Power? The Knights Templar, founded in 1119, had grown into a powerful military and financial institution. They owned vast estates across Europe, acted as bankers for kings, and wielded significant influence. Their wealth and independence made them a target for monarchs seeking to consolidate power. A. Financial and Political Ambitions By the early 14th century, the Templars had amassed enormous wealth, including land, gold, and international banking connections. Philip IV, deeply in debt to the Templars, saw them as a threat to his financial and political control. He had already issued the Ordinance on the Mint in 1296, which restricted the Templars’ banking activities, and in 1306, he attempted to seize their assets. B. The Arrest and Suppression On October 13, 1307, Philip ordered the arrest of all Templars in France. They were accused of heresy, blasphemy, and corruption. Under pressure from Philip, Pope Clement V issued the papal bull Pastoralis cautelis (1312), which dissolved the order and transferred its assets to the Knights Hospitaller. The Templars were tried by the papal tribunal at Avignon, but Philip’s influence ensured their eventual suppression. The Templars were disbanded, and their property was seized, further enriching the French crown. V. Why Did Other Christian Monarchs Not Resist? Despite the consolidation of papal authority under French control, other Christian monarchs did not mount a significant resistance. Several factors explain this. A. Lack of Unity Among Monarchs Europe was divided among numerous kingdoms, each with its own interests. There was no unified European resistance to French dominance. England, Germany, and Italy were politically fragmented and often at war with each other. The Holy Roman Empire, in particular, was weak and divided. B. Dependence on French Support Many monarchs relied on French support or were indebted to the French crown. Philip IV’s influence extended beyond France, and his diplomatic and military power deterred opposition. C. Papal Legitimacy and Fear of Schism The Papacy still held significant spiritual authority. Even if monarchs resented papal interference, they feared the consequences of challenging papal legitimacy. A break with Rome could lead to schism and religious instability. Some monarchs may have benefited from the suppression of the Templars, as it redistributed wealth and power in ways that aligned with their interests. Others may have calculated that resistance would be costly and ineffective. VI. The Long-Term Consequences of the Avignon Papacy The Avignon Papacy had profound and lasting effects on the Catholic Church and European politics. A. The Great Schism (1378–1417) The Avignon Papacy ended with the election of Pope Gregory XI, who returned to Rome in 1377. However, this did not resolve the crisis. The election of rival popes in Rome and Avignon led to the Great Schism, which further weakened the Papacy’s authority. B. Rise of National Churches The Avignon Papacy contributed to the rise of national churches and the decline of papal supremacy. Monarchs increasingly asserted control over religious affairs within their territories, laying the groundwork for the Reformation. C. Reform Movements The corruption and political entanglements of the Avignon Papacy sparked calls for reform. These movements eventually led to the Conciliar movement and the eventual decentralization of authority within the Church. VII. The Knights Templar: Legacy and Myth Despite their suppression, the Knights Templar have remained a subject of fascination and myth. They were portrayed as martyrs, guardians of hidden secrets, and victims of conspiracy. While many of these narratives are exaggerated, the Templars did represent a unique blend of military, spiritual, and financial power that challenged the medieval order. VIII. Conclusion: The Politics of Power in Medieval Christendom The Avignon Papacy and the fall of the Knights Templar were not isolated events but part of a broader struggle for power in medieval Christendom. Philip IV of France’s ability to install popes of his liking and dismantle the Templars was not merely a triumph of royal ambition; it was the result of a complex web of political, financial, and institutional factors. Other Christian monarchs did not resist these developments because they were fragmented, indebted, or complicit. The Papacy, though still spiritually revered, had become a tool of French policy. The suppression of the Templars, while justified by heresy, was also a strategic move to consolidate wealth and power. The Avignon Papacy marked the beginning of the end for the medieval Papacy’s universal authority. It set the stage for the Reformation, the rise of national churches, and the eventual secularization of European politics. In this context, the story of Philip IV, the Avignon Papacy, and the fall of the Knights Templar is not just a tale of ecclesiastical politics, but a reflection of the broader transformations that shaped the modern world. Bibliography Primary Sources:
Secondary Sources:
If you would like a shorter version or a version formatted for academic submission, I can adjust accordingly. >10k prompt to summarize wiki's Nuclear_Option articleuser:Please summarize the following wikipedia article: In the United States Senate, the nuclear option is a legislative procedure that allows the Senate to override a standing rule by a simple majority, avoiding the three-fifths supermajority normally required to invoke cloture on a measure. The term "nuclear option" is an analogy to nuclear weapons being the most extreme option in warfare. The nuclear option can be invoked by a senator raising a point of order that contravenes a standing rule. The presiding officer would then overrule the point of order based on Senate rules and precedents; this ruling would then be appealed and overturned by a simple majority vote (or a tie vote), establishing a new precedent. The nuclear option is made possible by the principle in Senate procedure that appeals from rulings of the chair on points of order relating to nondebatable questions are themselves nondebatable. The nuclear option is most often discussed in connection with the filibuster. Since cloture is a nondebatable question, an appeal in relation to cloture is decided without debate. This obviates the usual requirement for a two-thirds majority to invoke cloture on a resolution amending the Standing Rules. The nuclear option was invoked on November 21, 2013, when a Democratic majority led by Harry Reid used the procedure to reduce the cloture threshold for nominations, other than nominations to the Supreme Court, to a simple majority. On April 6, 2017, the nuclear option was used again, this time by a Republican majority led by Mitch McConnell, to extend that precedent to Supreme Court nominations, in order to enable cloture to be invoked on the nomination of Neil Gorsuch by a simple majority. The use of the nuclear option to abolish the 60-vote threshold for cloture on legislation has been proposed, but not successfully effected. On November 21, 2013, following a failed cloture vote on a nomination, the nuclear option was used, as follows: Once the presiding officer rules on the point of order, if the underlying question is nondebatable, any appeal is decided without debate. A simple majority is needed to sustain a decision of the chair. As the appeal is nondebatable, there is no supermajority requirement for cloture, as would be necessary for a proposition amending the rules. The presiding officer and the standing rule can therefore be overruled by a simple majority. This procedure establishes a new precedent that supersedes the plain text of the Standing Rules. These precedents will then be relied upon by future presiding officers in determining questions of procedure. The procedure may, for example, override requirements of Rule XXII, the cloture rule, in order to allow a filibuster to be broken without the usual 60-vote requirement. Originally, the Senate's rules did not provide for a procedure for the Senate to vote to end debate on a question so that it could be voted on, which opened the door to filibusters. In 1917, the Senate introduced a procedure to allow for ending debate (invoking cloture) with a two-thirds majority, later reduced in 1975 to three-fifths of the senators duly chosen and sworn (60 if there is no more than one vacancy). Thus, although a measure might have majority support, opposition from or absence by at least 41 senators can effectively defeat a bill by preventing debate on it from ending, in a tactic known as a filibuster. Since the 1970s, the Senate has also used a "two-track" procedure whereby Senate business may continue on other topics while one item is being filibustered. Since filibusters no longer require the minority to actually hold the floor and bring all other business to a halt, the mere threat of a filibuster has gradually become normalized. In the modern Senate, this means that most measures now typically requires 60 votes to advance, unless a specific exception limiting the time for debate applies. Changing Rule XXII to eliminate the 60-vote threshold is made difficult by the rules themselves. Rule XXII, paragraph 2, states that to end debate on any proposition "to amend the Senate rules [...] the necessary affirmative vote shall be two-thirds of the Senators present and voting". If all senators vote, 67 votes are required to invoke cloture on a proposition to amend a rule. Republican Senator Ted Stevens suggested using a ruling of the chair to defeat a filibuster of judicial nominees in February 2003. The code word for the plan was "Hulk". Weeks later, Senator Trent Lott coined the term nuclear option in March 2003 because the maneuver was seen as a last resort with possibly major consequences for both sides. The metaphor of a nuclear strike refers to the majority party unilaterally imposing a change to the filibuster rule, which might provoke retaliation by the minority party. The alternative term "constitutional option" is often used with particular regard to confirmation of executive and judicial nominations, on the theory that the United States Constitution requires these nominations to receive the "advice and consent" of the Senate. Proponents of this term argue that the Constitution implies that the Senate can act by a majority vote unless the Constitution itself requires a supermajority, as it does for certain measures such as the ratification of treaties. By effectively requiring a supermajority of the Senate to fulfil this function, proponents believed that (before the changes -- [such as the change made in 2013] -- to require only a simple majority) the previous Senate practice prevented the Senate from exercising its constitutional mandate. The remedy was therefore called the "constitutional option". The maneuver was brought to prominence in 2005 when Majority Leader Bill Frist threatened its use to end Democratic-led filibusters of judicial nominees submitted by President George W. Bush. In response to this threat, Democrats threatened to obstruct all routine Senate business. The ultimate confrontation was prevented by the Gang of 14, a group of seven Democratic and seven Republican Senators, all of whom agreed to oppose the nuclear option and oppose filibusters of judicial nominees, except in extraordinary circumstances. Several of the blocked nominees were brought to the floor, voted upon and approved as specified in the agreement, and others were dropped and did not come up for a vote, as implied by the agreement. In 2011, with a Democratic majority in the Senate (but not a 60-vote majority), Senators Jeff Merkley and Tom Udall proposed "a sweeping filibuster reform package" to be implemented by the nuclear option, but Majority Leader Harry Reid dissuaded them from pushing it forward. The nuclear option was raised again following the congressional elections of 2012, with Senate Democrats still in the majority (but short of a supermajority). The Democrats had been the majority party in the Senate since 2007, but only briefly did they have the 60 votes necessary to halt a filibuster. The Hill reported that Democrats would "likely" use the nuclear option in January 2013 to effect filibuster reform, but the two parties managed to negotiate two packages of amendments to Senate rules concerning filibusters that were agreed to on January 24, 2013, thus avoiding the need for the nuclear option. In July 2013, the nuclear option was raised as nominations were being blocked by Senate Republicans as Senate Democrats prepared to push through a change to the chamber's filibuster rule. On July 16, the Senate Democratic majority came within hours of using the nuclear option to win confirmation of seven of President Obama's long-delayed executive branch appointments. The confrontation was avoided when the White House withdrew two of the nominations in exchange for the other five being brought to the floor for a vote, where they were confirmed. Rule XVI of the Standing Rules of the Senate prohibits legislative material from being included in general appropriations bills. In 1995, during consideration of the Emergency Supplemental Appropriations and Rescissions for the Department of Defense to Preserve and Enhance Military Readiness Act of 1995, Senator Kay Bailey Hutchison offered an amendment that would have changed existing law regarding endangered species, therefore violating Rule XVI. Senator Harry Reid raised a point of order against the amendment, which the chair sustained. Hutchison appealed the ruling of the chair. The Senate voted against sustaining the decision of the chair by a vote of 42–57. The Senate thus set a precedent nullifying the provision of Rule XVI. In 1999, the Hutchison precedent was overturned (and the original effect of Rule XVI restored) when the Senate agreed to S.Res. 160, which states: 1996: FedEx precedent Rule XXVIII, paragraph 3, of the Standing Rules of the Senate prohibits any matter outside the scope of a conference from being included in a conference report. In 1996, during consideration of the conference report on the Federal Aviation Reauthorization Act of 1996, Majority Leader Trent Lott raised a point of order that the conference report exceeded the scope of the conference with respect to provisions relating to FedEx. After the point of order was sustained by the chair, Lott appealed the ruling of the chair. The Senate voted against sustaining the decision of the chair by a vote of 39–56. The Senate thus set a precedent nullifying the provision of Rule XXVIII. In 2000, the FedEx precedent was overturned (and the original effect of Rule XXVIII restored) when Congress passed the Legislative Branch Appropriations Act for fiscal year 2001, which states, in relevant part: 2013: Cloture on nominations On November 21, 2013, Majority Leader Harry Reid raised a point of order that "the vote on cloture under Rule XXII for all nominations other than for the Supreme Court of the United States is by majority vote." The presiding officer overruled the point of order, and the Senate voted 48–52 against sustaining the decision of the chair. The Senate therefore set a precedent that cloture can be invoked on nominations (except to the Supreme Court) by a simple majority, even though the plain text of the rule requires "three-fifths of the senators duly chosen and sworn" to invoke cloture. Three Democrats (Carl Levin, Joe Manchin and Mark Pryor) voted with all Republicans in favor of sustaining the decision of the chair. The text of Rule XXII was never changed. Although the 60-vote threshold was eliminated for most nominations, nominations are still susceptible to being delayed by filibusters, and 60 votes were still required to invoke cloture on other questions such as legislation and Supreme Court nominations. The Democrats' stated motivation for this change was the perceived expansion of filibustering by Republicans during the Obama administration, in particular blocking three nominations to the United States Court of Appeals for the District of Columbia Circuit. Republicans had asserted that the D.C. Circuit was underworked, and also cited the need for cost reduction by reducing the number of judges in that circuit. At the time of the vote, 59 executive branch nominees and 17 judicial nominees were awaiting confirmation. Prior to November 21, 2013, there had been only 168 cloture motions filed (or reconsidered) with regard to nominations. Nearly half of them (82) had been during the Obama administration. However, those cloture motions were often filed merely to speed things along, rather than in response to any filibuster. In contrast, there were just 38 cloture motions on nominations during the preceding eight years under President George W. Bush. Most of those cloture votes were successful. Obama won Senate confirmation for 30 out of 42 (71%) federal appeals court nominations, compared with Bush's 35 out of 52 (67%). Regarding Obama's federal district court nominations, the Senate approved 143 out of 173 (83%) as of November 2013, compared to George W. Bush's first term 170 of 179 (95%), Bill Clinton's first term 170 of 198 (86%), and George H.W. Bush's 150 of 195 (77%). Filibusters were used on 20 of Obama's nominations to district court positions, but Republicans had allowed confirmation of 19 out of the 20 before the nuclear option was invoked. On April 6, 2017, the Republican-majority Senate invoked the nuclear option and voted 48–52 along party lines against sustaining the decision of the chair on a point of order raised by Majority Leader Mitch McConnell, thus removing the Supreme Court exception created in 2013. This established a new precedent which allowed cloture to be invoked on Supreme Court nominations by a simple majority. The vote came after Senate Democrats filibustered the nomination of Neil Gorsuch to the Supreme Court of the United States. On April 3, 2019, in response to a perceived increase in postcloture filibusters by Senate Democrats on President Trump's executive and judicial nominations, the Republican-majority Senate voted 51-49 to overturn a ruling of the chair and thus set a precedent that postcloture debate on nominations—other than those to the Supreme Court of the United States, to the United States courts of appeals and to positions at Level I of the Executive Schedule—is two hours. All Republicans except Senators Susan Collins and Mike Lee voted against sustaining the decision of the chair. Senate Democrats accused the Republican majority under Majority Leader John Thune of exercising the nuclear option three times in 2025: Proposed use for legislation Following elimination of the 60-vote rule for nominations in 2013, senators expressed concerns that the 60-vote rule will eventually be eliminated for legislation via the nuclear option. While President, Donald Trump spoke out against the 60-vote requirement for legislation on several occasions. Then-Senate Majority Leader Mitch McConnell opposed abolishing the filibuster despite Trump's demands, and in April 2017, 61 senators (32 Republicans, 28 Democrats, and one independent) signed a letter stating their opposition to abolishing the filibuster for legislation. On January 21, 2018, Trump said on Twitter that if the shutdown stalemate continued, Republicans should consider the "nuclear option" in the Senate. He repeated the call on December 21, 2018, with a fresh shutdown looming. Concerns about abolishing the filibuster through the nuclear option were reiterated in 2021 as the Democratic-majority Senate could attempt to eliminate the filibuster through the nuclear option. On January 3, 2022, Senate Majority Leader Chuck Schumer announced that the Senate would vote on using the nuclear option to reform the filibuster in order to pass his party's election reform legislation. On January 19, 2022, Schumer made a point of order that would have allowed a 'talking filibuster' on a voting rights bill without any other dilatory measures. The Senate voted 52–48 to sustain the decision of the chair overruling the point of order. Senators Joe Manchin and Kyrsten Sinema voted with all Republicans in favor of sustaining the decision of the chair. In September 2024, then Vice President and presidential candidate Kamala Harris called for ending the filibuster to enact abortion legislation. After the appointment of Amy Coney Barrett to the Supreme Court of the United States, a proposed countermove to increase the size of the Supreme Court past nine seats (and thereby create new vacancies) was also described as a "nuclear option". Beyond the specific context of the U.S. Senate, the term "nuclear option" has come to be used generically for a procedural maneuver with potentially serious consequences, to be used as a last resort to overcome political opposition. The term has been used in connection with procedural maneuvers in various state senates. In a 2005 legal ruling on the validity of the Hunting Act 2004 the UK House of Lords, sitting in its judicial capacity, used "nuclear option" to describe the events of 1832, when the then-government threatened to create hundreds of new Whig peers to force the Tory-dominated Lords to accept the Reform Act 1832. (Nuclear weapons were not theorized until the 20th century, so the government's threat was not labeled as "nuclear" at the time.) assistant: Summary: The Nuclear Option in the U.S. Senate The "nuclear option" is a procedural tactic in the U.S. Senate that allows the chamber to bypass a 60-vote filibuster threshold for changing Senate rules by a simple majority vote (51 senators), rather than the usual two-thirds supermajority. It was coined in 2003 and analogized to a last-resort nuclear strike. Key Points:
In essence, the nuclear option is a controversial but established tool in Senate history, used to override long-standing rules by a simple majority, often in response to partisan gridlock. |
I also find that running long context with the recurrent form seems to hit a hard limit in llama_graph and crashes. Now that chunking form is implemented, this issue seems to be gone. |
|
I found that the optimal way to go in GGML is not chunking and recurrent, but actually chunking and autoregressive, with the autoregressive form for |
Thanks for your comment. Do you mean there is no need to change the chunking code? The autoregressive code is expected to be faster than the recurrent code? |
|
Autoregressive code skips all the operations needed to create the decay mask etc., so it's much faster for generation. The chunking code can be optimized on the GGML level (factorizing the chunk-irrelevant operations before the chunk loop, removing unnecessary cont / transpose operations etc.). You can compare with the history of Qwen3 Next pulls to see how I did it :) but basically for the first PR, a correct version with chunking + autoregressive passes is enough, optimizations can come later. |
|
What I said was that I learned from Qwen3 Next implementation for this implementation. So there should be no changes to Qwen3 Next. |
|
I sync'ed my code from b7240 to b7243 for even easier review and merge. It seems like it still works with the ggufs I made b4. Please give it a try and see if there are any new bugs. Will try to implement autoregressive form as suggested by pwilkin next. |
|
@ymcki i found kimi linear have larger memory usage (4g) compare to qwen next (1g) on same context size. is that expected behavior? response is also gibberish. on commit ymcki@30d883c with model downloaded from https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF/commit/d32f0993538b01d14e5470d1b2b50f297fd25498 . is there any info i need to provide in order to help your reproduce? |
|
I sync'ed my code from b7240 to b7243 for even easier review and merge. It seems like it still works with the ggufs I made b4. Please give it a try and see if there are any new bugs. Will try to implement autoregressive form as suggested by pwilkin next.
It is possible that Kimi Linear under the current implementation can use more VRAM for the new chunking and recurrent code and than Qwen3 Next because I used ggml_repeat and ggml_mul to replace ggml_mul_mat due to ggml only supports up to four dimensions. But 4x seems too much. Can you tell me how to reproduce?
When you say gibberish, do u mean u get it all the time or only for a specific prompt? If the latter, can you show me the prompt? |
You definitely don't want to do that. Qwen3 Next also has examples on how to pack the matrices for MUL_MAT and then unpack them again :) |
|
Replaced bulid_kda_recurrent with build_kda_autoregressive. About 60% gain in inference. Code is also sync'd to b7682. recurrent form llama-bench pp ~450t/s ~tg 20t/s./build/bin/llama-bench -m ~/Kimi-Linear-48B-A3B-Instruct-GGUF/Kimi-Linear-48B-A3B-Instruct.Q2_K.gguf -n 32 -d 8192 -b 64,128,256,512,1024,2048,4096,8192,16384 -ot shexp=CUDA0,"model.layers.0."=CUDA0 ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | n_batch | ot | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------------- | --------------: | -------------------: | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 64 | shexp=CUDA0 | pp512 @ d8192 | 408.36 ± 1.12 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 64 | shexp=CUDA0 | tg32 @ d8192 | 19.82 ± 0.05 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 128 | shexp=CUDA0 | pp512 @ d8192 | 358.01 ± 1.18 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 128 | shexp=CUDA0 | tg32 @ d8192 | 19.64 ± 0.22 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 256 | shexp=CUDA0 | pp512 @ d8192 | 418.22 ± 1.29 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 256 | shexp=CUDA0 | tg32 @ d8192 | 19.54 ± 0.22 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 512 | shexp=CUDA0 | pp512 @ d8192 | 452.92 ± 3.47 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 512 | shexp=CUDA0 | tg32 @ d8192 | 19.75 ± 0.08 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 1024 | shexp=CUDA0 | pp512 @ d8192 | 452.00 ± 4.07 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 1024 | shexp=CUDA0 | tg32 @ d8192 | 19.67 ± 0.12 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 2048 | shexp=CUDA0 | pp512 @ d8192 | 450.41 ± 4.37 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 2048 | shexp=CUDA0 | tg32 @ d8192 | 19.61 ± 0.08 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 4096 | shexp=CUDA0 | pp512 @ d8192 | 450.61 ± 3.60 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 4096 | shexp=CUDA0 | tg32 @ d8192 | 19.92 ± 0.05 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 8192 | shexp=CUDA0 | pp512 @ d8192 | 449.15 ± 3.80 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 8192 | shexp=CUDA0 | tg32 @ d8192 | 19.54 ± 0.15 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 16384 | shexp=CUDA0 | pp512 @ d8192 | 449.35 ± 3.07 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 16384 | shexp=CUDA0 | tg32 @ d8192 | 19.86 ± 0.12 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 64 | model.layers.0.=CUDA0 | pp512 @ d8192 | 399.53 ± 0.74 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 64 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.85 ± 0.06 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 128 | model.layers.0.=CUDA0 | pp512 @ d8192 | 351.87 ± 1.73 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 128 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.85 ± 0.15 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 256 | model.layers.0.=CUDA0 | pp512 @ d8192 | 413.28 ± 0.88 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 256 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.79 ± 0.08 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 512 | model.layers.0.=CUDA0 | pp512 @ d8192 | 446.92 ± 3.52 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 512 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.91 ± 0.08 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 1024 | model.layers.0.=CUDA0 | pp512 @ d8192 | 446.71 ± 2.79 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 1024 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.79 ± 0.12 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 2048 | model.layers.0.=CUDA0 | pp512 @ d8192 | 446.87 ± 3.15 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 2048 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.83 ± 0.08 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 4096 | model.layers.0.=CUDA0 | pp512 @ d8192 | 446.66 ± 3.05 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 4096 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.77 ± 0.08 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 8192 | model.layers.0.=CUDA0 | pp512 @ d8192 | 445.67 ± 3.94 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 8192 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.49 ± 0.34 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 16384 | model.layers.0.=CUDA0 | pp512 @ d8192 | 445.92 ± 4.00 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 16384 | model.layers.0.=CUDA0 | tg32 @ d8192 | 19.44 ± 0.20 |build: 67bee56 (7243) autoregressive form llama-bench pp ~450t/s ~tg 32t/s./build/bin/llama-bench -m ~/Kimi-Linear-48B-A3B-Instruct-GGUF/Kimi-Linear-48B-A3B-Instruct.Q2_K.gguf -n 32 -d 8192 -b 64,128,256,512,1024,2048,4096,8192,16384 -ot shexp=CUDA0,"model.layers.0."=CUDA0 ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | n_batch | ot | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------------- | --------------: | -------------------: | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 64 | shexp=CUDA0 | pp512 @ d8192 | 349.81 ± 1.44 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 64 | shexp=CUDA0 | tg32 @ d8192 | 30.83 ± 2.42 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 128 | shexp=CUDA0 | pp512 @ d8192 | 353.50 ± 1.19 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 128 | shexp=CUDA0 | tg32 @ d8192 | 30.67 ± 0.46 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 256 | shexp=CUDA0 | pp512 @ d8192 | 414.68 ± 1.34 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 256 | shexp=CUDA0 | tg32 @ d8192 | 31.82 ± 0.18 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 512 | shexp=CUDA0 | pp512 @ d8192 | 449.73 ± 3.58 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 512 | shexp=CUDA0 | tg32 @ d8192 | 31.80 ± 0.40 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 1024 | shexp=CUDA0 | pp512 @ d8192 | 451.11 ± 2.85 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 1024 | shexp=CUDA0 | tg32 @ d8192 | 31.67 ± 0.50 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 2048 | shexp=CUDA0 | pp512 @ d8192 | 451.03 ± 4.71 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 2048 | shexp=CUDA0 | tg32 @ d8192 | 30.58 ± 1.91 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 4096 | shexp=CUDA0 | pp512 @ d8192 | 450.53 ± 3.25 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 4096 | shexp=CUDA0 | tg32 @ d8192 | 31.60 ± 0.45 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 8192 | shexp=CUDA0 | pp512 @ d8192 | 448.78 ± 3.66 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 8192 | shexp=CUDA0 | tg32 @ d8192 | 31.29 ± 1.03 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 16384 | shexp=CUDA0 | pp512 @ d8192 | 449.71 ± 3.19 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 16384 | shexp=CUDA0 | tg32 @ d8192 | 31.51 ± 0.48 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 64 | model.layers.0.=CUDA0 | pp512 @ d8192 | 348.25 ± 0.75 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 64 | model.layers.0.=CUDA0 | tg32 @ d8192 | 31.92 ± 0.10 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 128 | model.layers.0.=CUDA0 | pp512 @ d8192 | 350.55 ± 2.45 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 128 | model.layers.0.=CUDA0 | tg32 @ d8192 | 30.88 ± 2.09 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 256 | model.layers.0.=CUDA0 | pp512 @ d8192 | 413.78 ± 0.99 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 256 | model.layers.0.=CUDA0 | tg32 @ d8192 | 30.85 ± 2.32 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 512 | model.layers.0.=CUDA0 | pp512 @ d8192 | 448.86 ± 3.44 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 512 | model.layers.0.=CUDA0 | tg32 @ d8192 | 31.93 ± 0.11 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 1024 | model.layers.0.=CUDA0 | pp512 @ d8192 | 449.72 ± 4.05 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 1024 | model.layers.0.=CUDA0 | tg32 @ d8192 | 32.14 ± 0.14 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 2048 | model.layers.0.=CUDA0 | pp512 @ d8192 | 448.00 ± 5.42 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 2048 | model.layers.0.=CUDA0 | tg32 @ d8192 | 31.83 ± 0.39 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 4096 | model.layers.0.=CUDA0 | pp512 @ d8192 | 448.81 ± 3.38 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 4096 | model.layers.0.=CUDA0 | tg32 @ d8192 | 31.90 ± 0.40 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 8192 | model.layers.0.=CUDA0 | pp512 @ d8192 | 447.99 ± 3.44 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 8192 | model.layers.0.=CUDA0 | tg32 @ d8192 | 31.91 ± 0.14 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 16384 | model.layers.0.=CUDA0 | pp512 @ d8192 | 448.73 ± 2.95 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 99 | 16384 | model.layers.0.=CUDA0 | tg32 @ d8192 | 31.39 ± 1.15 |build: 40f6118 (7682) |
Yeah, now you do the same thing that's done in Qwen3Next: vary which form is used according to |
The problem is that the cumsum term of KDA has an extra dimension of S_k. It doesn't seem to me it is possible to mul_mat two [chunk_size,chunk_size,S_k,CHB] tensors as mul_mat only works on the first two dimensions. If I use clamp, it is possible to use mul_mat but then the solution is an approximation and not exact. Would it work if I somehow pack them into [chunk_size*chunk_size,S_k,n_chunks,HB]??? |
Nah, that's not how it works. MUL_MAT works in all dimensions, but the 3rd and 4th are batch dimensions, so their role is technically a "foreach" loop. That's why you can generally squish the batch dimensions without any problems. As a general rule, you can always (w.r.t. correctness) do |
|
Committed a mul_mat with clamp version. VRAM usage reduced from 4GB to 3.6GB with the command below. Is this good enough? ./build/bin/llama-cli -m ~/Kimi-Linear-48B-A3B-Instruct-GGUF/Kimi-Linear-48B-A3B-Instruct.Q4_K_M.gguf -c 8192 -cmoe -ngl 100 |
|
Committed a mul_mat without clamp version that should give exact solutions. Code sync'd to b7712. |
|
I noticed that the version of @cacaview fixed by @Aaryan-Kapoor does not yet have GGUFs needs to be regenerated to enable MLA KV cache. I updated the ggufs in HF repo The old ggufs generated won't have MLA KV cache in the new implementation but my code is I believe now my implementation is now complete for an initial version. It would This is a Q4_K_M reply from a query of the 180K wikipedia article: 次は三島由紀夫のウィキペディア記事です。これに基づいて、彼の生涯の出来事を年表でまとめ>てください。三島由紀夫の生涯は、戦前から戦後に至る日本の激動の時代と深く結びついており、文学的業績だけでなく 政治思想や行動にも顕著な影響を与えた人物です。以下に、彼の生涯の主な出来事を年表形式でまとめます 。1925年(大正14年)
1938年(昭和13年)
1941年(昭和16年)
1945年(昭和20年)
1946年(昭和21年)
1947年(昭和22年)
1948年(昭和23年)
1949年(昭和24年)
1951年(昭和26年)
1954年(昭和29年)
1956年(昭和31年)
1961年(昭和36年)
1965年(昭和40年)
1968年(昭和43年)
1970年(昭和45年)
死後の評価
このように、三島由紀夫の生涯は、**文学と政治・行動を組み合わせた、日本の戦後史に深く刻まれた人物 |
|
I tried to test it, but it gave me an error. @ymcki |
Please compile the CPU version. In the current PR, we will only discuss the implementation related to the CPU version. |
Are you using an outdated convert_hf_to_gguf.py to generate your ggufs? I got this error if I used the old ones without these fives lines: They are essential to trigger the use of MLA KV cache. Otherwise, you will get this set_rows assertion in cpy_k. |
I download it from https://huggingface.co/ymcki/Kimi-Linear-48B-A3B-Instruct-GGUF |
I am sorry the Q4_K_M gguf I uploaded is outdated. I am re-uploading it now. Meanwhile, you can try Q2_K |
|
Shouldn't this be closed now? |
Make sure to read the contributing guidelines before submitting a PR
This is the current work progress:
#16930 (comment)