Skip to content

common : add common_speculative_is_compat()#19270

Merged
ggerganov merged 3 commits into
masterfrom
gg/spec-disable-for-recurrent
Feb 6, 2026
Merged

common : add common_speculative_is_compat()#19270
ggerganov merged 3 commits into
masterfrom
gg/spec-disable-for-recurrent

Conversation

@ggerganov

@ggerganov ggerganov commented Feb 2, 2026

Copy link
Copy Markdown
Member

fix #19267

Memory modules that do not support removing the last tokens from the context (such as recurrent modules) cannot perform speculative decoding. Add new common_speculative_is_compat() to query this functionality and use it in llama-server to disable speculative decoding for those contexts.

@ngxson ngxson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if we can do this inside common_speculative_init instead.

For example, common_speculative_init can try to evaluate 2 tokens, then remove the first one. If llama_memory_seq_rm returns error, then we throw an error saying the model is not compatible.

Btw, I think it's better to throw an error and exit, rather than a warning.

@coder543

coder543 commented Feb 3, 2026

Copy link
Copy Markdown

I just ran into #19267, and it would be cool if there were a way to make this compatible rather than just disabling it, but disabling it is better than crashing. With Qwen3-Coder-Next, ngram-mod could provide large speedups during coding workflows.

@ggerganov

Copy link
Copy Markdown
Member Author

@ngxson Implemented this idea in a new common_speculative_is_compat() helper function.

Btw, I think it's better to throw an error and exit, rather than a warning.

Do you have something specific in mind? In my server config, I want to set a default ngram-based spec decoding and have it applied for all routed models. When a routed model does not support it, it still continues to work. So I think a warning is better.

@ggerganov ggerganov changed the title llama : add llama_memory_can_rm_suffix() common : add common_speculative_is_compat() Feb 4, 2026
@ggerganov ggerganov requested a review from ngxson February 5, 2026 08:08
@ggerganov

Copy link
Copy Markdown
Member Author

@ngxson Gentle ping

@ngxson ngxson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry I missed the notif. LGTM!

Just wondering if we should also do the same check for draft model.

@ggerganov

Copy link
Copy Markdown
Member Author

Yes, I think we can do that. Will follow up in next PR.

@ggerganov ggerganov merged commit dfde599 into master Feb 6, 2026
70 of 78 checks passed
@ggerganov ggerganov deleted the gg/spec-disable-for-recurrent branch February 6, 2026 14:47
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* llama : add llama_memory_can_rm_suffix()

* Revert "llama : add llama_memory_can_rm_suffix()"

This reverts commit d30e59b.

* spec : check if the target context is compatible for spec decoding
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
* llama : add llama_memory_can_rm_suffix()

* Revert "llama : add llama_memory_can_rm_suffix()"

This reverts commit d30e59b.

* spec : check if the target context is compatible for spec decoding
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* llama : add llama_memory_can_rm_suffix()

* Revert "llama : add llama_memory_can_rm_suffix()"

This reverts commit d30e59b.

* spec : check if the target context is compatible for spec decoding
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
* llama : add llama_memory_can_rm_suffix()

* Revert "llama : add llama_memory_can_rm_suffix()"

This reverts commit d30e59b.

* spec : check if the target context is compatible for spec decoding
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* llama : add llama_memory_can_rm_suffix()

* Revert "llama : add llama_memory_can_rm_suffix()"

This reverts commit d30e59b.

* spec : check if the target context is compatible for spec decoding
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
* llama : add llama_memory_can_rm_suffix()

* Revert "llama : add llama_memory_can_rm_suffix()"

This reverts commit d30e59b.

* spec : check if the target context is compatible for spec decoding
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* llama : add llama_memory_can_rm_suffix()

* Revert "llama : add llama_memory_can_rm_suffix()"

This reverts commit d30e59b.

* spec : check if the target context is compatible for spec decoding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: spec-type ngram-mod crash with Qwen3Next

4 participants