ngram-mod: Reset i_last when low acceptance streak occurs by treo · Pull Request #22168 · ggml-org/llama.cpp

treo · 2026-04-20T13:05:51Z

Overview

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.

The existing behavior would skip the current context, thereby often losing the benefit of speculative decoding during later parts of the generation.

The effect of this seems to depend on both the model and the speculation parameters.

Benchmark:
vllm bench serve --model google/gemma-4-26B-A4B-it --host 127.0.0.1 --port 9876 --num-prompts 96 --dataset-name hf --dataset-path vdaita/edit_5k_char --backend openai-chat --endpoint '/v1/chat/completions' --max-concurrency 1

Gemma 4 26B-A4B:

Baseline: 97.57 t/s (peak 108)
without i_last = 0: 141.08 t/s (peak 1032)
with i_last = 0: 155.10 t/s (peak 1032)

Qwen 3.6 35B-A3B:

Baseline: 112.54 t/s (peak 127)
without i_last = 0: 148.56 t/s (peak 774)
with i_last = 0: 153.65 t/s (peak 768)

As we can see, the effect isn't huge but at 3% to 9% it is still measurable.
The price we pay for it is 1 line of code and a moment longer for speculative map repopulation whenever a low acceptance streak occours.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES. Used to help understand the ngram-mod flow.

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.

ggerganov · 2026-04-21T17:24:10Z

What speculative decoding parameters did you use in this test?

treo · 2026-04-21T18:13:48Z

Oh, sorry, it appears in cleaning up the layout I deleted the lines with the parameters.
The exact draft parameters were: --draft-max 128 --spec-ngram-size-n 48 --draft-min 2 --spec-type ngram-mod

Additionally, I had to disable reasoning (--reasoning off) to make the results of the benchmark repeatable.

I've been using this change ever since proposing it for agentic coding with slightly different parameters: -spec-type ngram-mod --spec-ngram-size-n 32 --draft-max 64 --draft-min 32 and whenever it needs to reproduce big chunks of code I often see 3x to 7x over baseline with context sizes over 100k token.

In my build I also added a little debug logging, and it is not uncommon to see a low acceptance streak right before a big chunk of 100% acceptance. With i_last not being reset, it wouldn't have anything to work with in those cases.

Cherry-picks 4 upstream PRs to enable speculative decoding on hybrid MoE+SSM architectures (Qwen3.6-35B-A3B): - ggml-org#19493 — speculative checkpointing (save/restore recurrent state) - ggml-org#22114 — refactor "use checkpoint" logic - ggml-org#22168 — reset i_last on low acceptance streak - ggml-org#22223 — add --spec-default argument Smoke tested on M5 Max with turbo4 KV — zero regression. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.

Cherry-picks 4 upstream PRs to enable speculative decoding on hybrid MoE+SSM architectures (Qwen3.6-35B-A3B): - ggml-org#19493 — speculative checkpointing (save/restore recurrent state) - ggml-org#22114 — refactor "use checkpoint" logic - ggml-org#22168 — reset i_last on low acceptance streak - ggml-org#22223 — add --spec-default argument Smoke tested on M5 Max with turbo4 KV — zero regression. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.

Cherry-picks 4 upstream PRs to enable speculative decoding on hybrid MoE+SSM architectures (Qwen3.6-35B-A3B): - ggml-org#19493 — speculative checkpointing (save/restore recurrent state) - ggml-org#22114 — refactor "use checkpoint" logic - ggml-org#22168 — reset i_last on low acceptance streak - ggml-org#22223 — add --spec-default argument Smoke tested on M5 Max with turbo4 KV — zero regression. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reset i_last when low acceptance streak occurs

5dc0dea

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.

treo requested a review from a team as a code owner April 20, 2026 13:05

treo mentioned this pull request Apr 21, 2026

server : speculative checkpointing #19493

Merged

ggerganov approved these changes Apr 21, 2026

View reviewed changes

ggerganov merged commit 72d693e into ggml-org:master Apr 21, 2026
1 check passed

chad-loder mentioned this pull request Apr 21, 2026

Sync upstream: speculative checkpointing for hybrid models TheTom/llama-cpp-turboquant#100

Closed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Apr 23, 2026

spec : reset i_last when low acceptance streak occurs (ggml-org#22168)

0290246

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.

treo mentioned this pull request Apr 27, 2026

ngram-mod: Reset i_last when low acceptance streak occurs ikawrakow/ik_llama.cpp#1701

Merged

4 tasks

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026

spec : reset i_last when low acceptance streak occurs (ggml-org#22168)

6947020

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.

samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026

spec : reset i_last when low acceptance streak occurs (ggml-org#22168)

ad97623

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026

spec : reset i_last when low acceptance streak occurs (ggml-org#22168)

83589ef

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.

ordokr mentioned this pull request May 13, 2026

Speculative decoding: ~130× decode regression on CUDA + turbo3 KV (RTX 5090, Qwen3.6-27B-Q6_K) despite 100% draft acceptance TheTom/llama-cpp-turboquant#143

Open

TheTom mentioned this pull request May 17, 2026

sync: upstream master b9190 + MTP/spec stack (DO NOT MERGE — tester review) TheTom/llama-cpp-turboquant#146

Merged

8 tasks

baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026

spec : reset i_last when low acceptance streak occurs (ggml-org#22168)

6e9ed80

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.

fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026

spec : reset i_last when low acceptance streak occurs (ggml-org#22168)

579557d

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ngram-mod: Reset i_last when low acceptance streak occurs#22168

ngram-mod: Reset i_last when low acceptance streak occurs#22168
ggerganov merged 1 commit into
ggml-org:masterfrom
treo:master

treo commented Apr 20, 2026

Uh oh!

ggerganov commented Apr 21, 2026

Uh oh!

treo commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

treo commented Apr 20, 2026

Overview

Requirements

Uh oh!

ggerganov commented Apr 21, 2026

Uh oh!

treo commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants