Skip to content

ngram-mod: Reset i_last when low acceptance streak occurs#22168

Merged
ggerganov merged 1 commit into
ggml-org:masterfrom
treo:master
Apr 21, 2026
Merged

ngram-mod: Reset i_last when low acceptance streak occurs#22168
ggerganov merged 1 commit into
ggml-org:masterfrom
treo:master

Conversation

@treo

@treo treo commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

Overview

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.

The existing behavior would skip the current context, thereby often losing the benefit of speculative decoding during later parts of the generation.

The effect of this seems to depend on both the model and the speculation parameters.

Benchmark:
vllm bench serve --model google/gemma-4-26B-A4B-it --host 127.0.0.1 --port 9876 --num-prompts 96 --dataset-name hf --dataset-path vdaita/edit_5k_char --backend openai-chat --endpoint '/v1/chat/completions' --max-concurrency 1

Gemma 4 26B-A4B:

  • Baseline: 97.57 t/s (peak 108)
  • without i_last = 0: 141.08 t/s (peak 1032)
  • with i_last = 0: 155.10 t/s (peak 1032)

Qwen 3.6 35B-A3B:

  • Baseline: 112.54 t/s (peak 127)
  • without i_last = 0: 148.56 t/s (peak 774)
  • with i_last = 0: 153.65 t/s (peak 768)

As we can see, the effect isn't huge but at 3% to 9% it is still measurable.
The price we pay for it is 1 line of code and a moment longer for speculative map repopulation whenever a low acceptance streak occours.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES. Used to help understand the ngram-mod flow.

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.
@treo treo requested a review from a team as a code owner April 20, 2026 13:05
@ggerganov

Copy link
Copy Markdown
Member

What speculative decoding parameters did you use in this test?

@treo

treo commented Apr 21, 2026

Copy link
Copy Markdown
Contributor Author

Oh, sorry, it appears in cleaning up the layout I deleted the lines with the parameters.
The exact draft parameters were: --draft-max 128 --spec-ngram-size-n 48 --draft-min 2 --spec-type ngram-mod

Additionally, I had to disable reasoning (--reasoning off) to make the results of the benchmark repeatable.

I've been using this change ever since proposing it for agentic coding with slightly different parameters: -spec-type ngram-mod --spec-ngram-size-n 32 --draft-max 64 --draft-min 32 and whenever it needs to reproduce big chunks of code I often see 3x to 7x over baseline with context sizes over 100k token.

In my build I also added a little debug logging, and it is not uncommon to see a low acceptance streak right before a big chunk of 100% acceptance. With i_last not being reset, it wouldn't have anything to work with in those cases.

@ggerganov ggerganov merged commit 72d693e into ggml-org:master Apr 21, 2026
1 check passed
TheTom added a commit to TheTom/llama-cpp-turboquant that referenced this pull request Apr 22, 2026
Cherry-picks 4 upstream PRs to enable speculative decoding on hybrid
MoE+SSM architectures (Qwen3.6-35B-A3B):

- ggml-org#19493 — speculative checkpointing (save/restore recurrent state)
- ggml-org#22114 — refactor "use checkpoint" logic
- ggml-org#22168 — reset i_last on low acceptance streak
- ggml-org#22223 — add --spec-default argument

Smoke tested on M5 Max with turbo4 KV — zero regression.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom added a commit to TheTom/llama-cpp-turboquant that referenced this pull request Apr 22, 2026
Cherry-picks 4 upstream PRs to enable speculative decoding on hybrid
MoE+SSM architectures (Qwen3.6-35B-A3B):

- ggml-org#19493 — speculative checkpointing (save/restore recurrent state)
- ggml-org#22114 — refactor "use checkpoint" logic
- ggml-org#22168 — reset i_last on low acceptance streak
- ggml-org#22223 — add --spec-default argument

Smoke tested on M5 Max with turbo4 KV — zero regression.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Apr 23, 2026
By resetting i_last to zero, we will include the current context when rebuilding the speculative map.
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
By resetting i_last to zero, we will include the current context when rebuilding the speculative map.
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
By resetting i_last to zero, we will include the current context when rebuilding the speculative map.
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
By resetting i_last to zero, we will include the current context when rebuilding the speculative map.
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
Cherry-picks 4 upstream PRs to enable speculative decoding on hybrid
MoE+SSM architectures (Qwen3.6-35B-A3B):

- ggml-org#19493 — speculative checkpointing (save/restore recurrent state)
- ggml-org#22114 — refactor "use checkpoint" logic
- ggml-org#22168 — reset i_last on low acceptance streak
- ggml-org#22223 — add --spec-default argument

Smoke tested on M5 Max with turbo4 KV — zero regression.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
Cherry-picks 4 upstream PRs to enable speculative decoding on hybrid
MoE+SSM architectures (Qwen3.6-35B-A3B):

- ggml-org#19493 — speculative checkpointing (save/restore recurrent state)
- ggml-org#22114 — refactor "use checkpoint" logic
- ggml-org#22168 — reset i_last on low acceptance streak
- ggml-org#22223 — add --spec-default argument

Smoke tested on M5 Max with turbo4 KV — zero regression.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Jcfunk pushed a commit to Jcfunk/llama.cpp that referenced this pull request May 13, 2026
Cherry-picks 4 upstream PRs to enable speculative decoding on hybrid
MoE+SSM architectures (Qwen3.6-35B-A3B):

- ggml-org#19493 — speculative checkpointing (save/restore recurrent state)
- ggml-org#22114 — refactor "use checkpoint" logic
- ggml-org#22168 — reset i_last on low acceptance streak
- ggml-org#22223 — add --spec-default argument

Smoke tested on M5 Max with turbo4 KV — zero regression.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Jcfunk pushed a commit to Jcfunk/llama.cpp that referenced this pull request May 13, 2026
Cherry-picks 4 upstream PRs to enable speculative decoding on hybrid
MoE+SSM architectures (Qwen3.6-35B-A3B):

- ggml-org#19493 — speculative checkpointing (save/restore recurrent state)
- ggml-org#22114 — refactor "use checkpoint" logic
- ggml-org#22168 — reset i_last on low acceptance streak
- ggml-org#22223 — add --spec-default argument

Smoke tested on M5 Max with turbo4 KV — zero regression.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
By resetting i_last to zero, we will include the current context when rebuilding the speculative map.
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
By resetting i_last to zero, we will include the current context when rebuilding the speculative map.
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
By resetting i_last to zero, we will include the current context when rebuilding the speculative map.
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
By resetting i_last to zero, we will include the current context when rebuilding the speculative map.
wel97459 pushed a commit to wel97459/llama-cpp-turboquant that referenced this pull request Jun 4, 2026
Cherry-picks 4 upstream PRs to enable speculative decoding on hybrid
MoE+SSM architectures (Qwen3.6-35B-A3B):

- ggml-org#19493 — speculative checkpointing (save/restore recurrent state)
- ggml-org#22114 — refactor "use checkpoint" logic
- ggml-org#22168 — reset i_last on low acceptance streak
- ggml-org#22223 — add --spec-default argument

Smoke tested on M5 Max with turbo4 KV — zero regression.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wel97459 pushed a commit to wel97459/llama-cpp-turboquant that referenced this pull request Jun 4, 2026
Cherry-picks 4 upstream PRs to enable speculative decoding on hybrid
MoE+SSM architectures (Qwen3.6-35B-A3B):

- ggml-org#19493 — speculative checkpointing (save/restore recurrent state)
- ggml-org#22114 — refactor "use checkpoint" logic
- ggml-org#22168 — reset i_last on low acceptance streak
- ggml-org#22223 — add --spec-default argument

Smoke tested on M5 Max with turbo4 KV — zero regression.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants