lookup, lookahead: fix crash when n_ctx not specified#18729
Merged
ggerganov merged 2 commits intoggml-org:masterfrom Jan 30, 2026
Merged
lookup, lookahead: fix crash when n_ctx not specified#18729ggerganov merged 2 commits intoggml-org:masterfrom
ggerganov merged 2 commits intoggml-org:masterfrom
Conversation
Since PR ggml-org#16653 (Dec 15, 2025), the default n_ctx is 0 to enable automatic GPU memory fitting. This causes llama-lookup and llama-lookahead to crash when run without explicit -c flag: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") Root cause: Both examples use params.n_ctx directly for batch initialization, but params.n_ctx remains 0 even after the context is properly initialized to n_ctx_train internally. Bug history: - Nov 2023: lookahead.cpp created (PR ggml-org#4207) with params.n_ctx pattern - Dec 2023: lookup.cpp created (PR ggml-org#4484) with same pattern - Nov 2024: default n_ctx changed to 4096 (PR ggml-org#10136) - bug dormant - Dec 2025: default n_ctx changed to 0 (PR ggml-org#16653) - bug activated The bug was dormant for 2+ years because params.n_ctx defaulted to 512, then 4096. PR ggml-org#16653 changed it to 0 for GPU auto-fitting, triggering the crash. Fix: Use llama_n_ctx(ctx) to get the actual runtime context size, matching the pattern already used elsewhere in lookup.cpp (line 72) and in speculative.cpp/speculative-simple.cpp. Tested: llama-lookup now works without -c flag (12.5% acceptance on Gemma-3-1B). Note: llama-lookahead has a separate pre-existing issue with sequence initialization (n_seq_max=1 vs W+G+1 needed) that is unrelated to this fix.
pestopoppa
added a commit
to pestopoppa/llama.cpp
that referenced
this pull request
Jan 10, 2026
llama-lookahead has been broken since PR ggml-org#14482 (July 2025) which changed seq_id validation from LLAMA_MAX_SEQ constant to context-specific n_seq_max. Two lookahead-specific issues: 1. n_seq_max: Lookahead needs W + G + 1 = 31 sequences for parallel Jacobi decoding, but params.n_parallel defaulted to 1. Fix: Set params.n_parallel = W + G + 1 before context creation. 2. KV unified: Batch splitting with coupled sequences requires unified KV cache mode, but lookahead didn't enable it. Fix: Set params.kv_unified = true. Bug timeline: - Nov 2023: lookahead.cpp created, worked with LLAMA_MAX_SEQ constant - July 2025: PR ggml-org#14482 changed to n_seq_max validation, broke lookahead Note: This PR depends on ggml-org#18729 for the batch init fix (params.n_ctx -> llama_n_ctx). Both PRs are needed for lookahead to fully work. Tested with Qwen2.5-Coder-0.5B: lookahead generates output with n_accept > 0. Bug history researched with Claude.
pestopoppa
added a commit
to pestopoppa/llama.cpp
that referenced
this pull request
Jan 10, 2026
llama-lookahead has been broken since PR ggml-org#14482 (July 2025) which changed seq_id validation from LLAMA_MAX_SEQ constant to context-specific n_seq_max. Two lookahead-specific issues: 1. n_seq_max: Lookahead needs W + G + 1 = 31 sequences for parallel Jacobi decoding, but params.n_parallel defaulted to 1. Fix: Set params.n_parallel = W + G + 1 before context creation. 2. KV unified: Batch splitting with coupled sequences requires unified KV cache mode, but lookahead didn't enable it. Fix: Set params.kv_unified = true. Bug timeline: - Nov 2023: lookahead.cpp created, worked with LLAMA_MAX_SEQ constant - July 2025: PR ggml-org#14482 changed to n_seq_max validation, broke lookahead Note: This PR depends on ggml-org#18729 for the batch init fix (params.n_ctx -> llama_n_ctx). Both PRs are needed for lookahead to fully work. Tested with Qwen2.5-Coder-0.5B: lookahead generates output with n_accept > 0. Bug history researched with Claude.
JohannesGaessler
approved these changes
Jan 10, 2026
Lookahead decoding requires: - W + G + 1 = 31 sequences for parallel Jacobi decoding - Unified KV cache for coupled sequences in batch splitting These requirements were broken after PR ggml-org#14482 changed validation logic. Consolidates fix from PR ggml-org#18730 per maintainer request. Commit message drafted with Claude.
Contributor
Author
|
@ngxson Per your request, I've consolidated the fix from PR #18730 into this PR. New commit:
This PR now contains all lookahead-related fixes and is ready for review. Comment drafted with Claude. |
ngxson
approved these changes
Jan 10, 2026
4b1tQu4ntN3k0
pushed a commit
to 4b1tQu4ntN3k0/llama.cpp
that referenced
this pull request
Feb 2, 2026
* lookup, lookahead: fix crash when n_ctx not specified Since PR ggml-org#16653 (Dec 15, 2025), the default n_ctx is 0 to enable automatic GPU memory fitting. This causes llama-lookup and llama-lookahead to crash when run without explicit -c flag: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") Root cause: Both examples use params.n_ctx directly for batch initialization, but params.n_ctx remains 0 even after the context is properly initialized to n_ctx_train internally. Bug history: - Nov 2023: lookahead.cpp created (PR ggml-org#4207) with params.n_ctx pattern - Dec 2023: lookup.cpp created (PR ggml-org#4484) with same pattern - Nov 2024: default n_ctx changed to 4096 (PR ggml-org#10136) - bug dormant - Dec 2025: default n_ctx changed to 0 (PR ggml-org#16653) - bug activated The bug was dormant for 2+ years because params.n_ctx defaulted to 512, then 4096. PR ggml-org#16653 changed it to 0 for GPU auto-fitting, triggering the crash. Fix: Use llama_n_ctx(ctx) to get the actual runtime context size, matching the pattern already used elsewhere in lookup.cpp (line 72) and in speculative.cpp/speculative-simple.cpp. Tested: llama-lookup now works without -c flag (12.5% acceptance on Gemma-3-1B). Note: llama-lookahead has a separate pre-existing issue with sequence initialization (n_seq_max=1 vs W+G+1 needed) that is unrelated to this fix. * lookahead: fix n_seq_max and kv_unified configuration Lookahead decoding requires: - W + G + 1 = 31 sequences for parallel Jacobi decoding - Unified KV cache for coupled sequences in batch splitting These requirements were broken after PR ggml-org#14482 changed validation logic. Consolidates fix from PR ggml-org#18730 per maintainer request. Commit message drafted with Claude.
shaofeiqi
pushed a commit
to qualcomm/llama.cpp
that referenced
this pull request
Feb 6, 2026
* lookup, lookahead: fix crash when n_ctx not specified Since PR ggml-org#16653 (Dec 15, 2025), the default n_ctx is 0 to enable automatic GPU memory fitting. This causes llama-lookup and llama-lookahead to crash when run without explicit -c flag: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") Root cause: Both examples use params.n_ctx directly for batch initialization, but params.n_ctx remains 0 even after the context is properly initialized to n_ctx_train internally. Bug history: - Nov 2023: lookahead.cpp created (PR ggml-org#4207) with params.n_ctx pattern - Dec 2023: lookup.cpp created (PR ggml-org#4484) with same pattern - Nov 2024: default n_ctx changed to 4096 (PR ggml-org#10136) - bug dormant - Dec 2025: default n_ctx changed to 0 (PR ggml-org#16653) - bug activated The bug was dormant for 2+ years because params.n_ctx defaulted to 512, then 4096. PR ggml-org#16653 changed it to 0 for GPU auto-fitting, triggering the crash. Fix: Use llama_n_ctx(ctx) to get the actual runtime context size, matching the pattern already used elsewhere in lookup.cpp (line 72) and in speculative.cpp/speculative-simple.cpp. Tested: llama-lookup now works without -c flag (12.5% acceptance on Gemma-3-1B). Note: llama-lookahead has a separate pre-existing issue with sequence initialization (n_seq_max=1 vs W+G+1 needed) that is unrelated to this fix. * lookahead: fix n_seq_max and kv_unified configuration Lookahead decoding requires: - W + G + 1 = 31 sequences for parallel Jacobi decoding - Unified KV cache for coupled sequences in batch splitting These requirements were broken after PR ggml-org#14482 changed validation logic. Consolidates fix from PR ggml-org#18730 per maintainer request. Commit message drafted with Claude.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a crash in
llama-lookupandllama-lookaheadwhen run without explicit-cflag:Root Cause
Both examples use
params.n_ctxdirectly for batch initialization:Since #16653 changed the default
n_ctxto 0 (for GPU auto-fitting),params.n_ctxremains 0 even after the context is properly initialized. This creates a zero-sized batch that crashes on the firstcommon_batch_add().Bug History
This bug was dormant for 2+ years:
The pattern was always incorrect, but only triggered when
n_ctxdefault became 0.Fix
Use
llama_n_ctx(ctx)to get the actual runtime context size:This matches:
lookup.cpp:72formax_context_sizespeculative.cppandspeculative-simple.cppTesting
Note
llama-lookaheadhas a separate pre-existing issue with sequence initialization (n_seq_max=1when it needsW+G+1) that is unrelated to this batch size fix.