[Speculative decoding] feat: add EAGLE3 speculative decoding support by ruixiang63 · Pull Request #18039 · ggml-org/llama.cpp

ruixiang63 · 2025-12-14T20:00:13Z

Important

The old PR has been backed up in this branch: https://github.com/ruixiang63/llama.cpp/tree/eagle3-v1-backup
The new commits in this PR have been rebased onto the latest master branch, refactored to use the new speculative API, cherry-picked from #22728, and made compatible with MTP.

Tip

New Eagle3 models for Gemma4 are now supported. With reasoning enabled, speedup can exceed 2x. With reasoning disabled, it can reach over 3x. See #18039 (comment)
With Q4_K_M quantization, the speedup still looks good. #18039 (comment)

As discussed in #15902, Eagle3 represents the current SOTA in speculative decoding and is widely adopted across the industry. Integrating Eagle3 into llama.cpp enhances its performance and strengthens its competitiveness among leading inference frameworks. With Eagle3 speculative decoding now integrated into llama.cpp, inference performance has been significantly improved, achieving a 2–3× speedup.
This enhancement is the result of close collaboration between the NVIDIA and GGML teams, showcasing a strong technical partnership.

The following provides a brief overview of this PR:

EAGLE3 is an encoder-decoder based speculative decoding method:

Extracts features from target model at specific layers
Uses feature fusion layer to compress target features
Generates draft tokens with single-layer decoder
Maps draft vocabulary to target vocabulary via d2t tensor

Key changes:

Add LLM_ARCH_EAGLE3 architecture
Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp)
Add feature extraction from target model layers
Add g_embeddings handling for decoder input
Add GGML_TENSOR_FLAG_SYNC for GPU synchronization
Add --eagle3 flag for speculative-simple example
Add EAGLE3 model conversion in convert_hf_to_gguf.py

EAGLE3 Architecture Overview :

┌─────────────────────────────────────────────────────────────────┐
│                    EAGLE3 Overview                              │
└─────────────────────────────────────────────────────────────────┘

  Target Model          EAGLE3 Encoder         EAGLE3 Decoder
  (LLaMA 8B)              (FC Layer)           (1-layer Transformer)
       │                      │                       │
       │                      │                       │
       ▼                      ▼                       ▼
┌─────────────┐        ┌─────────────┐        ┌─────────────────┐
│  Generate   │        │  Compress   │        │  Generate Draft │
│  Features   │───────►│  Features   │───────►│  Tokens Fast    │
│  [12288]    │        │  [4096]     │        │  [k tokens]     │
└─────────────┘        └─────────────┘        └────────┬────────┘
                                                       │
                                                       ▼
                                              ┌─────────────────┐
                                              │  Verify Drafts  │
                                              │  with Target    │
                                              └─────────────────┘

How to run EAGLE3 in llama.cpp

Requirements

This PR currently ~~only supports two~~ supports following EAGLE3 models:

RedHatAI/gemma-4-31B-it-speculator.eagle3
RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3
yuhuili/EAGLE3-LLaMA3.1-Instruct-8B
yuhuili/EAGLE3-LLaMA3.3-Instruct-70B
Tengyunw/qwen3_8b_eagle3
Tengyunw/qwen3_30b_moe_eagle3
AngelSlim/Qwen3-8B_eagle3
AngelSlim/Qwen3-14B_eagle3
AngelSlim/Qwen3-32B_eagle3
AngelSlim/Qwen3-30B-A3B_eagle3
lmsys/EAGLE3-gpt-oss-120b-bf16 (with performance issue due to its MoE arch)
nvidia/gpt-oss-120b-Eagle3-long-context (with performance issue due to its MoE arch)
RedHatAI/Qwen3-8B-speculator.eagle3
RedHatAI/gpt-oss-20b-speculator.eagle3 (with performance issue due to its MoE arch)
RedHatAI/Qwen3-30B-A3B-Instruct-2507-speculator.eagle3

The following eagle3 models should also work out of the box, though they haven’t been tested yet:

Step 1: Convert Models to GGUF Format

Convert Target Model

TARGET_MODEL_HF="${MODELS_DIR}/Meta-Llama-3.1-8B-Instruct"
TARGET_MODEL_GGUF="${MODELS_DIR}/Meta-Llama-3.1-8B-Instruct_bf16.gguf"

python convert_hf_to_gguf.py \
    "${TARGET_MODEL_HF}" \
    --outtype bf16 \
    --outfile "${TARGET_MODEL_GGUF}"

Convert EAGLE3 Draft Model

TARGET_MODEL_HF="${MODELS_DIR}/Meta-Llama-3.1-8B-Instruct"
EAGLE3_MODEL_HF="${MODELS_DIR}/EAGLE3-LLaMA3.1-Instruct-8B"
EAGLE3_MODEL_GGUF="${MODELS_DIR}/EAGLE3-LLaMA3.1-Instruct-8B_fp16.gguf"

python convert_hf_to_gguf.py \
    "${EAGLE3_MODEL_HF}" \
    --outtype f16 \
    --target-model-dir "${TARGET_MODEL_HF}" \
    --outfile "${EAGLE3_MODEL_GGUF}"

Step 2: Compile llama.cpp

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

[Optional] Step 3: Quantize the GGUF model

./build/bin/llama-quantize \
  ${TARGET_MODEL_GGUF} \
  ${TARGET_MODEL_GGUF}_Q4_K_M.gguf \
  Q4_K_M
 
./build/bin/llama-quantize \
  ${EAGLE3_MODEL_GGUF} \
  ${EAGLE3_MODEL_GGUF}_Q4_K_M.gguf \
  Q4_K_M

Step 4: Run EAGLE3 Speculative Decoding

./build/bin/llama-server \
    -m  Qwen3-8B.gguf \
    -md qwen3_8b_eagle3.gguf \
    --spec-type draft-eagle3 \
    --spec-draft-n-max 8 \
    --spec-draft-p-min 0.5 \
    -np 1 \
    -c 4096 --port 8080 -ngl 99 -fa on \
    --jinja --fit off

curl -X POST http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "messages": [
            {"role": "user", "content": "Write a quicksort algorithm in Python. Write code only."}
        ],
        "max_tokens": 256,
        "temperature": 0
    }'

curl -X POST http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "messages": [
            {"role": "user", "content": "Explain the Pythagorean theorem"}
        ],
        "max_tokens": 256,
        "temperature": 0
    }'

Performance Evaluation (RTX A6000 48GB)

Tip

Using the chat_template for each model version can improve acceptance rates. Always apply the model’s corresponding chat_template when constructing prompts.

Note

After refactoring, the performance data below may differ from current results, especially since llama-server now supports Eagle3 as well. However, the data is still useful for getting a general sense of the speedup Eagle3 provides.

LLaMA3.1-Instruct-8B with BF16, its Eagle3 with FP16

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	44.5 t/s	146.2 t/s	80.6%	3.28x
Explain the Pythagorean theorem	44.5 t/s	127.1 t/s	77.4%	2.85x
Plan a 1 day trip to DC	44.5 t/s	113.8 t/s	80.9%	2.55x

LLaMA3.1-Instruct-8B with Q4_K_M, its Eagle3 with Q4_K_M

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	121.5 t/s	274.4 t/s	92.5%	2.26x
Explain the Pythagorean theorem	121.4 t/s	238.9 t/s	79.4%	1.97x
Plan a 1 day trip to DC	121.4 t/s	196.5 t/s	77.2%	1.62x

LLaMA3.3-Instruct-70B with Q4_K_M, its Eagle3 with Q4_K_M

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	15.6 t/s	33.4 t/s	73.6%	2.14x
Explain the Pythagorean theorem	15.6 t/s	37.6 t/s	82.0%	2.41x
Plan a 1 day trip to DC	15.6 t/s	28.8 t/s	69.3%	1.85x

Qwen3-8B with BF16, its Eagle3 with BF16

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	43.6 t/s	94.8 t/s	69.8%	2.17x
Explain the Pythagorean theorem	43.6 t/s	86.8 t/s	68.3%	1.99x
Plan a 1 day trip to DC	43.6 t/s	70.7 t/s	57.3%	1.62x

Qwen3-14B with BF16, its Eagle3 with BF16

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	24.4 t/s	35.7 t/s	40.4%	1.46x
Explain the Pythagorean theorem	24.4 t/s	34.5 t/s	41.3%	1.41x
Plan a 1 day trip to DC	24.3 t/s	30.5 t/s	28.0%	1.26x

Qwen3-32B with Q4_K_M, its Eagle3 with Q4_K_M

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	32.0 t/s	39.7 t/s	39.7%	1.24x
Explain the Pythagorean theorem	32.0 t/s	41.5 t/s	43.3%	1.30x
Plan a 1 day trip to DC	32.0 t/s	37.1 t/s	32.6%	1.16x

Qwen3-30B-A3B with BF16, its Eagle3 with BF16 (tested on NVIDIA DGX Spark 128GB, speedup might be better on other hardwares)

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	31.1 t/s	43.3 t/s	64.4%	1.39x
Explain the Pythagorean theorem	31.2 t/s	41.2 t/s	60.6%	1.32x
Plan a 1 day trip to DC	30.9 t/s	38.6 t/s	58.8%	1.25x

GPT-OSS-20B with BF16, its Eagle3 with BF16 (tested on NVIDIA DGX Spark 128GB, similar performance issue as GPT-OSS-120B Eagle3)

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	61.3 t/s	65.05 t/s	74.25%	1.06x
Explain the Pythagorean theorem	61.2 t/s	58.13 t/s	69.23%	0.95x
Plan a 1 day trip to DC	61.4 t/s	54.50 t/s	62.96%	0.89x

Details of GGML backend modifications (Fixed, no longer needed)

~~In the Eagle3 decoder, two parallel inputs are processed:~~

input_embeds ──→ RMS_NORM ──┐
                            ├──→ CONCAT ──→ Transformer Decoder
g_embeddings ──→ RMS_NORM ──┘

~~When both RMS_NORM operations run in the same GPU split, a lack of synchronization causes buffer contention and race conditions (CPU execution is fine as it auto‑syncs between subgraphs).~~

~~Solution:~~
~~Use ggml_set_sync() to add a synchronization point after the first RMS_NORM, forcing the scheduler to create a split boundary and synchronize before continuing.~~

input_embeds ──→ RMS_NORM ──→ [SYNC] ──┐
                                       ├──→ CONCAT ──→ Transformer Decoder
g_embeddings ─────────────→ RMS_NORM ──┘
         (split 1)            |         (split 2)
                           barrier

~~This ensures correct execution and can be applied to any parallel path that needs synchronization, not just Eagle3.~~

Examples results

examples

Prompt: "Write a quicksort algorithm in Python. Write code only."

Prompt: "Explain the Pythagorean theorem"

Prompt: "Plan a 1 day trip to DC"

Future Steps

Support more Eagle3 models, currently support Qwen, GPT-OSS, Llama
~~Currently, Eagle3 is integrated only in llama-speculative-simple, support may need to be extended to other APIs if possible~~ It now supports llama-server
Support context-dependent tree sampling (tree attention) as described in the Eagle3 paper to improve accept rate
~~Support batch processing (batch size > 1) with Eagle3 speculative decoding~~

ngxson · 2025-12-15T16:23:42Z

Judging by the description of this PR, I believe many models with multiple-token prediction also have the same strategy of reusing hidden features from the main model.

It can be quite interesting to generalize this features to support other models. I would expect some kind of sub-llama_context that allow both the main and draft models to share the same cgraph, avoiding the need of explicitly passing the intermediate embedding through the host memory.

ggerganov · 2025-12-15T18:39:27Z

It can be quite interesting to generalize this features to support other models.

I will definitely be looking at refactoring the implementation to become more generic before merging it. The initial results in terms of performance are really great, but we'll need to work on cleaning up the code and reduce the special-casing in several places. I'll try to provide insights how to do that in the next days.

ruixiang63 · 2025-12-16T17:07:40Z

It can be quite interesting to generalize this features to support other models.

I will definitely be looking at refactoring the implementation to become more generic before merging it. The initial results in terms of performance are really great, but we'll need to work on cleaning up the code and reduce the special-casing in several places. I'll try to provide insights how to do that in the next days.

Thanks @ggerganov @ngxson for your inputs. Definitely, looking forward to hearing your feedback and improving this PR.

ggerganov · 2025-12-17T14:06:13Z

+        // EAGLE3: Extract intermediate layer features from target model at layer INPUT
+        if (eagle3 && cparams.eagle3_extract_enabled && !eagle3->extract_layer_indices.empty()) {
+            static const char * eagle3_extract_names[] = {"eagle3_extract_0", "eagle3_extract_1", "eagle3_extract_2"};
+            for (size_t i = 0; i < eagle3->extract_layer_indices.size() && i < 3; ++i) {
+                if (eagle3->extract_layer_indices[i] == il) {
+                    cb(inpL, eagle3_extract_names[i], il);
+                    break;
+                }
+            }
+        }


I will next look to remove this ad hoc logic and generalize it some way. Likely by passing the extraction points in some more generic way during llama_context creation. TBD

ggerganov · 2025-12-17T14:10:25Z

+
+    // EAGLE3 draft model - target model hidden size
+    uint32_t eagle3_target_hidden_size = 0;
+


This can become more generic by renaming it to n_embd_enc and utilizing the n_embd_inp() call.

ggerganov · 2025-12-17T14:13:46Z

+    // Get pointer to target model features extracted for EAGLE3 encoder
+    // Returns NULL if no features are available
+    // Format: [3*n_embd, n_tokens] - use model.hparams.n_embd and batch.n_tokens for dimensions
+    LLAMA_API const float * llama_get_eagle3_target_features(struct llama_context * ctx);


This call should become more generic and not Eagle3 specific. Will be looking how to achieve this in the best way.

ggerganov · 2025-12-17T14:18:51Z

+    // Set g_embeddings from EAGLE3 encoder output for decoder input
+    // g_embd: pointer to encoder output embeddings
+    LLAMA_API void llama_set_eagle3_g_embeddings(
+            struct llama_context * ctx,
+                   const float * g_embd,
+                       int32_t   n_embd,
+                       int32_t   n_tokens);
+


Might be possible to avoid this API if we combine the Eagle encoder and decoder in a single context. TBD

When combining the Eagle3 encoder and decoder into a single context, note that the Eagle3 encoder is used only to fuse the extracted features from the target model, i.e. it is invoked as many times as the target model itself. The Eagle3 decoder, on the other hand, is solely responsible for generating draft tokens in autoregressive way.
llama_set_eagle3_g_embeddings() sets the g_embedding both from the Eagle3 encoder (used in the first generation step of the Eagle3 decoder) and from the Eagle3 decoder itself (used in subsequent generation steps).

Yup, I noticed this interaction. We don't have a previous use case similar to this, but I think the enc-dec context could be adapted accordingly.

pwilkin · 2026-01-06T01:11:39Z

Bumping, is there any progress on this? It's probably one of the more coveted features to have right now.

ggerganov · 2026-01-06T06:53:20Z

Bumping, is there any progress on this?

I'm currently side-tracked by some graph reallocation optimizations. Will probably come back to this after that.

ruixiang63 · 2026-01-09T12:24:04Z

Eagle3 checkpoints for the Qwen3 series (including both dense and MoE models) are now supported, see the updated PR description for details.
Although these Eagle3 checkpoints are from third party, they can still deliver a 1–2× speedup.
Speculative decoding performance for MoE models is not as good as dense models, which is expected, since more experts are invoked during the parallel verification phase than during the target model’s decoding phase.

ruixiang63 · 2026-01-09T18:38:30Z

One question: it seems that CUDA Graph is disabled when the input n_tokens > 1. During the target model verification stage of speculative decoding, CUDA Graph is always disabled for the target model, since it’s only used for verification with multiple draft tokens > 1. However, we can fix the number of draft tokens (e.g., by using padding) to make it constant and thus enable CUDA Graph (may need to remove n_tokens > 1 constraint)? @ggerganov

Context: I’m testing GPT-OSS-120B Eagle3 with llama.cpp, and I found that even with Eagle3 (accept rate 86%), the performance is worse than the naive llama-cli. After profiling, I discovered that CUDA Graph is consistently disabled for the target model during speculative decoding, whereas it remains enabled in llama-cli. This results in the target model’s verification(prefiling) phase being roughly >5× times slower compared to normal autoregressive decoding step.
After disabling CUDA graphs for llama-cli using GGML_CUDA_DISABLE_GRAPHS=1, the eagle3 achieved roughly a 1.5× speedup.

I’ve only observed this performance issue with GPT-OSS-120B Eagle3. For other models, even without CUDA Graph enabled for target model in Eagle3 speculative decoding, the performance remains great.

ggerganov · 2026-01-12T07:32:19Z

Speculative decoding performance for MoE models is not as good as dense models, which is expected, since more experts are invoked during the parallel verification phase than during the target model’s decoding phase.

I think the small-batch mul_mat_id could be improved in the CUDA backend. AFAIR there the performance for batch sizes (1, 8] is not optimal atm. Need double check.

However, we can fix the number of draft tokens (e.g., by using padding) to make it constant and thus enable CUDA Graph (may need to remove n_tokens > 1 constraint)? @ggerganov

Possibly, but to me this sounds like second-order optimization. Optimizing the mul_mat_id for small batches should bring more generic benefits and would likely have larger impact for speculative decoding compared to enabling CUDA graphs.

After disabling CUDA graphs for llama-cli using GGML_CUDA_DISABLE_GRAPHS=1, the eagle3 achieved roughly a 1.5× speedup.

Hm, this is a bit surprising observation. Can you run a llama-batched-bench test on your system with and without CUDA graphs using the commands from #18308 (comment) and share the results. We are interested in batch sizes [1, 4]. So something like this:

llama-batched-bench -m [gpt-oss-120b] -c 65536 -b 2048 -ub 512 -npp 1024 -ntg 32 -npl 1,2,3,4,5,6,7,8

ruixiang63 · 2026-01-12T19:48:46Z

Thanks very much for your inputs! @ggerganov

After disabling CUDA graphs for llama-cli using GGML_CUDA_DISABLE_GRAPHS=1, the eagle3 achieved roughly a 1.5× speedup.

Hm, this is a bit surprising observation. Can you run a llama-batched-bench test on your system with and without CUDA graphs using the commands from #18308 (comment) and share the results. We are interested in batch sizes [1, 4]. So something like this:

I double-checked the run today. The previous statement about cuda graph was incorrect due to instability and concurrent CPU activity in my test environment, sorry about that! Currently, enabling or disabling CUDA Graphs doesn’t have much impact in llama-cli for GPT-OSS-120B model. (I am testing on DGX Spark)

with cuda graph enabled: [ Prompt: 120.1 t/s | Generation: 47.2 t/s ]
without cuda graph enabled: [ Prompt: 119.2 t/s | Generation: 45.7 t/s ]

Also, the results for llama-batched-bench:

with cuda graph enabled

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
1024	256	1	1280	1.227	834.89	5.255	48.71	6.482	197.47
1024	256	2	2560	1.579	1296.74	9.277	55.19	10.856	235.81
1024	256	3	3840	2.284	1344.72	10.447	73.51	12.731	301.61
1024	256	4	5120	3.031	1351.58	11.550	88.66	14.580	351.16
1024	256	5	6400	3.780	1354.59	12.433	102.96	16.212	394.76
1024	256	6	7680	4.528	1356.95	13.347	115.08	17.874	429.66
1024	256	7	8960	5.304	1351.48	13.982	128.16	19.286	464.59
1024	256	8	10240	6.018	1361.20	14.704	139.28	20.722	494.16

without cuda graph enabled

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
1024	256	1	1280	1.279	800.61	5.758	44.46	7.037	181.90
1024	256	2	2560	1.597	1282.22	9.297	55.07	10.895	234.98
1024	256	3	3840	2.286	1343.84	10.383	73.97	12.669	303.11
1024	256	4	5120	3.031	1351.51	11.547	88.68	14.577	351.23
1024	256	5	6400	3.771	1357.64	12.438	102.91	16.209	394.84
1024	256	6	7680	4.525	1357.71	13.342	115.12	17.868	429.83
1024	256	7	8960	5.289	1355.17	13.986	128.13	19.275	464.84
1024	256	8	10240	5.999	1365.48	14.653	139.77	20.652	495.83

Possibly, but to me this sounds like second-order optimization. Optimizing the mul_mat_id for small batches should bring more generic benefits and would likely have larger impact for speculative decoding compared to enabling CUDA graphs.

I agree. CUDA graphs could be second-order optimization.
Here are the eagle3 GPT-OSS-120B test results on DGX spark: (I will also test this on other hardwares)

Prompt	Baseline (llama-cli)	EAGLE3 (draft_size=8)	Accept Rate	Speedup
Write a quicksort algorithm in Python. Write code only.	48.3 t/s	52.2 t/s	85.0%	1.08x
Explain the Pythagorean theorem	47.8 t/s	46.5 t/s	74.0%	0.97x
Plan a 1 day trip to DC	48.4 t/s	40.0 t/s	55.7%	0.83x

For MoE models, prefilling becomes the main performance bottleneck because more active experts are involved. As a result, the assumption that “processing multiple draft tokens concurrently is as fast as processing a single token” no longer holds, which is an important condition for effective speculative decoding. I also saw that as the draft token length increases, the verification cost of the target model also rises.
This explains the results shown in the table above, in some cases, Eagle3 can even degrade performance. To observe improvements, the accept rate must exceed a certain lower bound.

Do you have any rough ideas that how much performance gain we can get through imporving mul_mat_id?

ggerganov · 2026-01-12T20:13:40Z

The llama-batched-bench results are actually better than I expected. In the previous reported numbers there was a sharp dip at BS = 2. Here the TG performance steadily increases with the batch size which is good, though it is not as linear as we want it to be.

I suppose the explanation is that for MoE models, at low batch sizes the amount of data we need to read from the weights for each batch increases linearly with the batch size (i.e. each extra token in the batch activates more experts and at small batch size the experts for each token are very likely different from each other). So it's probably normal that TG for MoE does not scale as well as TG for dense models as a function of the batch size.

As a result, the assumption that “processing multiple draft tokens concurrently is as fast as processing a single token” no longer holds, which is an important condition for effective speculative decoding.

Yeah, that's my guess as well. Do we have some references to cross-check this? Does the Eagle3 authors discuss it's performance for MoE models? Do we have sample numbers for gpt-oss-120 with Eagle3 using vllm, trrt?

Do you have any rough ideas that how much performance gain we can get through imporving mul_mat_id?

Hm, not sure. Thinking about it now, I feel like mul_mat_id is unlikely to scale good enough due to the increasing data for each new token.

arch-btw · 2026-01-14T00:18:29Z

The following eagle3 models should also work out of the box, though they haven’t been tested yet:
Qwen3-235B-A22B-EAGLE3

I tested the Baichuan-M3-235B model that was released yesterday (draft here). It's a finetune of the Qwen3 model above. It quantized successfully but failed due to having a different tensor shape (even in the original weights):

load_tensors: EAGLE3 using d2t mapping (draft_vocab_size = 32000)
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected  8192,  8192, got  8192, 16384,     1,     1
llama_model_load_from_file_impl: failed to load model
failed to load EAGLE3 draft model

I haven't looked into how often this to happen in finetunes of the same model, especially in the context of eagle3.

However, the shapes of the tensors changing might be something to account for in the implementation (in this case Qwen3). Unless those will be treated as completely new models, in which case please disregard this comment.

ruixiang63 · 2026-01-14T16:40:59Z

The following eagle3 models should also work out of the box, though they haven’t been tested yet:
Qwen3-235B-A22B-EAGLE3

I tested the Baichuan-M3-235B model that was released yesterday (draft here). It's a finetune of the Qwen3 model above. It quantized successfully but failed due to having a different tensor shape (even in the original weights):
load_tensors: EAGLE3 using d2t mapping (draft_vocab_size = 32000)
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected  8192,  8192, got  8192, 16384,     1,     1
llama_model_load_from_file_impl: failed to load model
failed to load EAGLE3 draft model
I haven't looked into how often this to happen in finetunes of the same model, especially in the context of eagle3.

However, the shapes of the tensors changing might be something to account for in the implementation (in this case Qwen3). Unless those will be treated as completely new models, in which case please disregard this comment.

I spent some time analyzing the Baichuan-EAGLE3 draft model. It has a slightly different architecture compared to the standard Qwen3-EAGLE3 model.
The main difference is in the self_attn.q_proj.weight tensor shape:

Standard Qwen3-EAGLE3 : [8192, 8192] — outputs Q only
Baichuan-EAGLE3: [16384, 8192] — outputs Q + Gate (2× the size)

This is because Baichuan-EAGLE3 uses an Attention Output Gate mechanism, which is not present in the standard EAGLE3 model. In this variant:

The Q projection outputs both query vectors and gate vectors
After attention computation, the output is element-wise multiplied by sigmoid(gate) before the output projection

This is essentially a variant architecture of EAGLE3, not just a tensor shape difference. Supporting this variant would require:

Detecting the gate mechanism during model loading
Modifying the graph construction to split Q/Gate and apply the gating after attention
Adding the ggml_sigmoid operation in the attention path

I would suggest we focus this PR on the standard EAGLE3 model first. Once merged, we can consider adding support for this gated variant in a follow-up PR.

Have you tested the standard Qwen3-EAGLE3 model as well? Does it work well with the current implementation? If yes, could you please share the t/s and speedup you got with eagle3? @arch-btw

ngxson · 2026-01-14T16:50:01Z

Since EAGLE3 can vary quite a lot for each model, maybe a better way is to consider it as an adapter (the same logic as lora adapter), instead of a dedicated arch?

That way, it can hook into existing models more easily, making internal data like KV state, gate, etc, accessible to the draft model.

ruixiang63 · 2026-01-14T16:54:43Z

Since EAGLE3 can vary quite a lot for each model, maybe a better way is to consider it as an adapter (the same logic as lora adapter), instead of a dedicated arch?

That way, it can hook into existing models more easily, making internal data like KV state, gate, etc, accessible to the draft model.

Good point. However, Eagle3 doesn’t vary much across models. So far, except for Baichuan-Eagle3, all other models essentially use the same Eagle3 architecture. Please refer to the supported models listed in the PR description. I’d say the majority of models share the same Eagle3 architecture, with only a few exceptions. This standalone Eagle3 architecture strategy is also adopted in TensorRT-LLM, vLLM, and SGLang.

ngxson · 2026-01-14T17:24:41Z

I doubt that. In theory, nothing prevent them or another team from making a variant of eagle3 that get the state of more than 3 layers, or even reuse the KV state from earlier layers. Possibilities are endless, and that's why it's important to think about the bigger picture instead of just trying to make it work with one single existing architecture.

I think a more model-agnostic approach via adapter API (or another API based on that form) will likely be the way ultimately. It will allow computing both the next token + draft token in one pass, allowing even higher performance than this approach.

ruixiang63 · 2026-01-14T19:33:14Z

I doubt that. In theory, nothing prevent them or another team from making a variant of eagle3 that get the state of more than 3 layers, or even reuse the KV state from earlier layers. Possibilities are endless, and that's why it's important to think about the bigger picture instead of just trying to make it work with one single existing architecture.

Could you please share some examples or real-world use cases of this? I’d like to better understand how such an approach might be applied in practice.

ngxson · 2026-01-14T21:42:53Z

The main problem with this PR and #15225 is that both assumes that the MTP (multi-token prediction) to work this way:

main LLM generates first tokens + hidden_state from a list of selected layers
hidden_state is then forwarded to the speculative model to generate N next tokens

(Note: the dash line is to tell that it's may not be the case for all models; some only use the last hidden state)

While it does work for the moment, this approach doesn't address the true nature of MTP models. In other words, it is not truly model-agnostic. The main drawbacks is that you must manually pass the embeddings between 2 models, so you must know where to get the embeddings, its shapes, etc.

Instead, we should look at MTP models as a normal LLM with multiple output heads:

In this POV, it's not matter what is the implementation of the mtp_head. From the outside world, the model will just output N next tokens given one input token.

In practice, the mtp_head(s) can be:

A single encoder model in eagle3 case
Some extra N layers in GLM/deepseek case, N+1 layer use output from N-th layer (serial setup)
Some extra N layers in Xiaomi Mimo case, but all layers use the same input embeddings (parallel setup) - see llama : add Xiaomi Mimo (with proper MTP - multi token predict) #13236 for more info

Now, returning to your question:

Could you please share some examples or real-world use cases of this? I’d like to better understand how such an approach might be applied in practice.

If you already get the idea above, then consider gemma3n: the model has 30 layers, but only 20 layers has KV projection. The last 10 layers reuse the KV from the 20-th layer. Some models also implement this idea, notably GLM, bailing.

The same idea can be apply to MTP layers. Future models may has MTP layers to not just reuse the layer output hidden state, but also the projected KV inside the layer. While there is no models in the wild currently doing that, Baichuan-EAGLE3 (as you shown), already someway heading towards this direction by exposing both the Q+gate to the MTP model.

ngxson · 2026-01-14T21:52:15Z

(I have to split up my comment otherwise it's too long)

My proposal is that we must design this function + the API in a way that it is flexible enough for future models.

For EAGLE3, the MTP model is technically a mtp_head shipped as an extension to the main model (note that the eagle3 repo only contains the extra tensors, but does not contain the main LLM), it can be viewed as an adapter, much like how LoRA works.

For the API, we must avoid leaking the information about the implementation under the hood. The downstream code must only know about how many tokens can be generated, they don't need to know how to generate these extra tokens.

So, an array of API as follow should be enough:

llama_model_load_mtp: load the mtp as a llama_adapter_lora or maybe we can add a new struct for it
llama_mtp_set_n_draft: set the max number of draft tokens to be generated in the next llama_decode; set to 0 for verification pass
llama_mtp_get_n_draft_max: get max number of draft tokens that the MTP head can generate
llama_mtp_get_logits_ith: get logits at for i-th token in batch, returns array of float with size n_vocab*n_draft

All the info about embeddings and the draft model must be kept private.

CC @ggerganov maybe this is helpful for you

ggerganov · 2026-06-11T16:22:05Z

Should be good to merge after review of the Python code.

bartowski1182 · 2026-06-11T18:24:50Z

+
+            # target_layers: derived from target model layer count (low/mid/high)
+            target_num_layers = target_config["num_hidden_layers"]
+            target_layers = [2, target_num_layers // 2, target_num_layers - 3]


Should we also prefer the eagle3 config when eagle_aux_hidden_state_layer_ids is present?

Same question for vocab size when draft_vocab_size exists

Should we also prefer the eagle3 config when eagle_aux_hidden_state_layer_ids is present?

Good question. First, many Eagle3 checkpoints do not include eagle_aux_hidden_state_layer_ids. Also, different Eagle3 checkpoints interpret layer_ids differently: some expect the IDs to be set before extracting the layers, while others expect them to be set afterward, which can sometimes require adding +1.
To avoid this ambiguity, I decided to compute the values manually based on the original paper and its implementation, rather than relying on the Eagle3 config. This ensures that the target layers are 100% correct without postprocessing and keep code logic aligned.

Same question for vocab size when draft_vocab_size exists

Both draft_vocab_size and the target model’s vocab_size are needed when performing the d2t vocab mapping for Eagle3. The target model’s vocab_size serves as an assertion to ensure that the d2t mapping does not go out of vocabulary.

yikes, guess the EAGLE3 rollout has not been smooth 😅

thanks for the clarity! unfortunate but logical :)

yeah it is. Thanks for the review!

bartowski1182 · 2026-06-11T18:26:09Z

I notice some of the more recent uploads on huggingface (like https://huggingface.co/nvidia/Kimi-K2.6-Eagle3) don't use any of the listed archs, instead using Eagle3DeepseekV2ForCausalLM

Is there any way we can understand Eagle3 at the start of the arch as an eagle3 draft checkpoint and route it the same way?

ruixiang63 · 2026-06-11T19:05:26Z

I notice some of the more recent uploads on huggingface (like https://huggingface.co/nvidia/Kimi-K2.6-Eagle3) don't use any of the listed archs, instead using Eagle3DeepseekV2ForCausalLM

Is there any way we can understand Eagle3 at the start of the arch as an eagle3 draft checkpoint and route it the same way?

Oh, thanks for sharing this. I wasn’t aware that we have a new Kimi 2.6 Eagle3 model checkpoint. (This is great, eagle3 ecosystem is still growing, so it is great we have it in llama.cpp now).
The config looks different, and I haven’t had a chance to look into it yet. I’m not sure how they implemented Eagle3 here, maybe they used a decoder layer from DeepSeekV2?
All the Eagle3 checkpoints I’ve seen so far are based on Llama’s decoder layer, which matches the original paper. Maybe this one is different.

I also found another Kimi 2.6 Eagle3 config that seems aligned with our current approach: https://huggingface.co/lightseekorg/kimi-k2.6-eagle3/blob/main/config.json

CISC · 2026-06-11T19:16:13Z

+                        raise ValueError(f"EAGLE-3 d2t target ids out of range for target vocab size {self.target_vocab_size}")
+                    if np.unique(data).size != data.size:
+                        raise ValueError("EAGLE-3 d2t contains duplicate target ids")
+                data_qtype = gguf.GGMLQuantizationType.I64


Future-proofing is nice and all, but n_tokens and token ids are limited to int32_t, what is the original dtype?

The original d2t dtype is torch.int64 in the eagle3 checkpoint. That's why I used I64 to preserve that and avoid any accidental truncation after converting it to absolute target ids.

And we have a assert check here: https://github.com/ruixiang63/llama.cpp/blob/0bd54498f273bf290a8fd55152deedf8e7c878dc/src/models/eagle3.cpp#L309

CISC · 2026-06-11T19:20:25Z

            self.origin_hf_arch = hparams.get('architectures', [None])[0]

+        # Detect eagle3 draft checkpoint by hparams (some models don't use a distinct HF arch name)
+        if "draft_vocab_size" in self.hparams and self.hparams["num_hidden_layers"] == 1:


Not important right now, but I'm guessing all this will basically be duplicated for every arch supported with very little if any differences? Would be nice if it can be refactored in a reusable way.

Good point! I kept this local for now because all Eagle3 checkpoints I have encountered so far are based on Llama decoder (no matter where they come from RedHat, LMSYS, NVIDIA, etc), and this PR only targets that path unless we find an Eagle3 checkpoint based on a different architecture. (potentially this #18039 (comment) but not sure).

If another architecture needs Eagle3 conversion later, this should be the first piece to factor out.

pwilkin

Okay, would probably prefer the property to live in llama.py since Eagle is the only arch that uses it, but I'll let @CISC decide that one :)

CISC · 2026-06-11T20:42:32Z

Okay, would probably prefer the property to live in llama.py since Eagle is the only arch that uses it, but I'll let @CISC decide that one :)

That will change though, so makes sense that it's in base.

ruixiang63 · 2026-06-11T20:42:48Z

Okay, would probably prefer the property to live in llama.py since Eagle is the only arch that uses it, but I'll let @CISC decide that one :)

Thanks for approval!
Makes sense. I kept it in ModelBase because target_model_dir is passed through the common converter constructor, even though Eagle3 is the only current consumer.
DFlash would be another consumer in the future :) #22105

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ruixiang63 · 2026-06-11T21:12:50Z

Thanks @bartowski1182 @pwilkin @CISC for the review! I guess this PR is good to merge :) CC @ggerganov

CISC · 2026-06-11T21:15:30Z

Let's see if we win the \r\n lottery today...

Edit: Yay, green: https://github.com/ggml-org/llama.cpp/actions/runs/27377588058/job/80906815068?pr=18039

AbdulrahmanHashem · 2026-06-12T09:04:16Z

i cannot find qwen3.6 eagle3 if it exist can someone please link it.

laurentpayot · 2026-06-12T09:21:29Z

i cannot find qwen3.6 eagle3 if it exist can someone please link it.

Btw, direct GGUF files would be greatly appreciated 😅

sswtodo · 2026-06-12T10:56:01Z

Thank you for introducing Eagle3 !

Here are benchmarks from ruixiang63 -> #18039 (comment)

I have did additional benchmarks And here are results FreeBSD + 7900 xtx :

* [RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3](https://huggingface.co/RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3)
* [gemma-4-26B-A4B-it-GGUF Q4_K_M)[https://huggingface.co/ggml-org/gemma-4-26B-A4B-it-GGUF]
* [Assistant IT Q8_0] (https://huggingface.co/google/gemma-4-26B-A4B-it-assistant)

reasoning off

Prompt	Baseline	EAGLE3 (draft_size=1)	Speedup
Write a quicksort algorithm in Python. Write code only.	120.81 t/s	128.68 t/s (89.4%)	1.06x
Explain the Pythagorean theorem	119.30 t/s	127.06 t/s (84%)	1.06x
Plan a 1 day trip to DC	115.70 t/s	103.24 t/s (75.6%)	0.89x
Paste file content llama-context.cpp	94.87 t/s	75.77 t/s (41.9%)	0.79x

Prompt	Baseline	Assistant (draft_size=1)	Speedup
Write a quicksort algorithm in Python. Write code only.	120.81 t/s	169.12 t/s (94.4%)	1.39x
Explain the Pythagorean theorem	119.30 t/s	164.68 t/s (89.9%)	1.38x
Plan a 1 day trip to DC	115.70 t/s	146.29 t/s (71.9%)	1.26x
Paste file content llama-context.cpp	94.87 t/s	116.51 t/s (73.6%)	1.23x

According to the above, the Assistant MTP issue was resolved in PR #24277. However, I’m still seeing a similar performance regression on Eagle 3 — while working with an agentic tool like claude-code, I’m observing a significant drop in draft acceptance, ranging from 9% to 33% together with performance drop.

Advice is appreciated

fafinet · 2026-06-12T13:37:34Z

It crashes on load with -sm tensor for me using gemma 4 31b and redhat's eagle3 model converted to bf16.

0.57.900.481 I srv    load_model: initializing slots, n_slots = 4
0.58.099.454 I common_speculative_impl_draft_eagle3: adding speculative implementation 'draft-eagle3'
0.58.099.462 I common_speculative_impl_draft_eagle3: - n_max=8, n_min=0, p_min=0.500000
/home/local/downloads/llama.cpp/ggml/src/ggml-backend-meta.cpp:728: GGML_ASSERT(src_ss[0].axis != GGML_BACKEND_SPLIT_AXIS_1) failed
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(+0x18665) [0x7f8670d5f665]
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x1df) [0x7f8670d5fa3f]
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x11e) [0x7f8670d5fbce]
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(+0x41f2f) [0x7f8670d88f2f]
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(+0x39a75) [0x7f8670d80a75]
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(+0x42cac) [0x7f8670d89cac]
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(+0x45420) [0x7f8670d8c420]
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(ggml_gallocr_alloc_graph+0x483) [0x7f8670d74c23]
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(ggml_backend_sched_alloc_graph+0x111) [0x7f8670d7af11]
/home/local/downloads/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xbd) [0x7f86704d291d]
/home/local/downloads/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x368) [0x7f86704d83d8]
/home/local/downloads/llama.cpp/build/bin/libllama.so.0(llama_decode+0xb) [0x7f86704d9fbb]
/home/local/downloads/llama.cpp/build/bin/libllama-common.so.0(_Z25common_context_can_seq_rmP13llama_context+0xc8) [0x7f86709e3b08]
/home/local/downloads/llama.cpp/build/bin/libllama-server-impl.so(_ZN19server_context_impl10load_modelER13common_params+0xcd4) [0x7f8671572314]
/home/local/downloads/llama.cpp/build/bin/libllama-server-impl.so(_Z12llama_serveriPPc+0x2ece) [0x7f86714c57de]
/lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f8670e35ca8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f8670e35d65]
./.llama-cpp/llama-server(+0x11b1) [0x55c5ab6f21b1]
Aborted

loci-dev mentioned this pull request Dec 14, 2025

UPSTREAM PR #18039: [Speculative decoding] feat: add EAGLE3 speculative decoding support auroralabs-loci/llama.cpp#568

Open

github-actions Bot added model Model specific examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Dec 14, 2025

github-actions Bot mentioned this pull request Dec 15, 2025

Reddit News Daily 2025-12-15 gitlawr/reddit-daily-news#94

Open

ggerganov reviewed Dec 15, 2025

View reviewed changes

Comment thread src/models/eagle3.cpp Outdated

ggerganov reviewed Dec 17, 2025

View reviewed changes

CISC linked an issue Dec 22, 2025 that may be closed by this pull request

Feature Request: Support EAGLE3 models for draft model / speculative decoding use cases #15305

Closed

4 tasks

jagusztinl mentioned this pull request Dec 29, 2025

Feature Request: support for EAGLE3 ikawrakow/ik_llama.cpp#1081

Open

4 tasks

srogmann mentioned this pull request Dec 29, 2025

Add self‑speculative decoding (no draft model required) #18471

Merged

pwilkin added the hot Something that is hot label Jan 6, 2026

ggerganov marked this pull request as ready for review June 11, 2026 12:26

ggerganov requested review from a team, CISC and JohannesGaessler as code owners June 11, 2026 12:26

ggerganov marked this pull request as draft June 11, 2026 12:26

ggerganov added 2 commits June 11, 2026 15:50

llama : clean-up names

9baa68b

cont : add assert + comment

0bd5449

ggerganov marked this pull request as ready for review June 11, 2026 14:08

bartowski1182 reviewed Jun 11, 2026

View reviewed changes

pwilkin requested changes Jun 11, 2026

View reviewed changes

Comment thread conversion/base.py

Comment thread conversion/base.py

Comment thread conversion/base.py

CISC reviewed Jun 11, 2026

View reviewed changes

pwilkin approved these changes Jun 11, 2026

View reviewed changes

CISC approved these changes Jun 11, 2026

View reviewed changes

Comment thread conversion/llama.py Outdated

Update conversion/llama.py

7c42aff

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

CISC approved these changes Jun 11, 2026

View reviewed changes

ggerganov approved these changes Jun 12, 2026

View reviewed changes

ggerganov merged commit 88a3927 into ggml-org:master Jun 12, 2026
28 of 29 checks passed

LiaXLiang mentioned this pull request Jun 12, 2026

docs: add eagle3 to speculative doc #24540

Open

Zertruermmerdog mentioned this pull request Jun 12, 2026

Eval bug: EAGLE3 with Qwen3.6 (qwen3_5 hybrid) target — missing t_layer_inp hooks, and llama_decode(ctx_dft) rc=-1 once context exceeds ~700 tokens #24541

Open


		// EAGLE3 draft model - target model hidden size
		uint32_t eagle3_target_hidden_size = 0;

Conversation

ruixiang63 commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to run EAGLE3 in llama.cpp

Requirements

Step 1: Convert Models to GGUF Format

Step 2: Compile llama.cpp

[Optional] Step 3: Quantize the GGUF model

Step 4: Run EAGLE3 Speculative Decoding

Performance Evaluation (RTX A6000 48GB)

Details of GGML backend modifications (Fixed, no longer needed)

Examples results

Future Steps

Uh oh!

Uh oh!

ngxson commented Dec 15, 2025

Uh oh!

ggerganov commented Dec 15, 2025

Uh oh!

ruixiang63 commented Dec 16, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruixiang63 Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Jan 6, 2026

Uh oh!

ggerganov commented Jan 6, 2026

Uh oh!

ruixiang63 commented Jan 9, 2026

Uh oh!

ruixiang63 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jan 12, 2026

Uh oh!

ruixiang63 commented Jan 12, 2026

Uh oh!

ggerganov commented Jan 12, 2026

Uh oh!

arch-btw commented Jan 14, 2026

Uh oh!

ruixiang63 commented Jan 14, 2026

Uh oh!

ngxson commented Jan 14, 2026

Uh oh!

ruixiang63 commented Jan 14, 2026

Uh oh!

ngxson commented Jan 14, 2026

Uh oh!

ruixiang63 commented Jan 14, 2026

Uh oh!

ngxson commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jun 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bartowski1182 commented Jun 11, 2026

ruixiang63 commented Dec 14, 2025 •

edited

Loading

ruixiang63 Dec 17, 2025 •

edited

Loading

ruixiang63 commented Jan 9, 2026 •

edited

Loading

ngxson commented Jan 14, 2026 •

edited

Loading

ngxson commented Jan 14, 2026 •

edited

Loading

ruixiang63 commented Jun 11, 2026 •

edited

Loading

ruixiang63 Jun 11, 2026 •

edited

Loading

CISC commented Jun 11, 2026 •

edited

Loading

sswtodo commented Jun 12, 2026 •

edited

Loading