Skip to content

[Speculative decoding] feat: add EAGLE3 speculative decoding support#18039

Merged
ggerganov merged 28 commits into
ggml-org:masterfrom
ruixiang63:eagle3-adapt-new-arch
Jun 12, 2026
Merged

[Speculative decoding] feat: add EAGLE3 speculative decoding support#18039
ggerganov merged 28 commits into
ggml-org:masterfrom
ruixiang63:eagle3-adapt-new-arch

Conversation

@ruixiang63

@ruixiang63 ruixiang63 commented Dec 14, 2025

Copy link
Copy Markdown
Contributor

Important

The old PR has been backed up in this branch: https://github.com/ruixiang63/llama.cpp/tree/eagle3-v1-backup
The new commits in this PR have been rebased onto the latest master branch, refactored to use the new speculative API, cherry-picked from #22728, and made compatible with MTP.

Tip

New Eagle3 models for Gemma4 are now supported. With reasoning enabled, speedup can exceed 2x. With reasoning disabled, it can reach over 3x. See #18039 (comment)
With Q4_K_M quantization, the speedup still looks good. #18039 (comment)

As discussed in #15902, Eagle3 represents the current SOTA in speculative decoding and is widely adopted across the industry. Integrating Eagle3 into llama.cpp enhances its performance and strengthens its competitiveness among leading inference frameworks. With Eagle3 speculative decoding now integrated into llama.cpp, inference performance has been significantly improved, achieving a 2–3× speedup.
This enhancement is the result of close collaboration between the NVIDIA and GGML teams, showcasing a strong technical partnership.

The following provides a brief overview of this PR:

EAGLE3 is an encoder-decoder based speculative decoding method:

  • Extracts features from target model at specific layers
  • Uses feature fusion layer to compress target features
  • Generates draft tokens with single-layer decoder
  • Maps draft vocabulary to target vocabulary via d2t tensor

Key changes:

  • Add LLM_ARCH_EAGLE3 architecture
  • Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp)
  • Add feature extraction from target model layers
  • Add g_embeddings handling for decoder input
  • Add GGML_TENSOR_FLAG_SYNC for GPU synchronization
  • Add --eagle3 flag for speculative-simple example
  • Add EAGLE3 model conversion in convert_hf_to_gguf.py

EAGLE3 Architecture Overview :

┌─────────────────────────────────────────────────────────────────┐
│                    EAGLE3 Overview                              │
└─────────────────────────────────────────────────────────────────┘

  Target Model          EAGLE3 Encoder         EAGLE3 Decoder
  (LLaMA 8B)              (FC Layer)           (1-layer Transformer)
       │                      │                       │
       │                      │                       │
       ▼                      ▼                       ▼
┌─────────────┐        ┌─────────────┐        ┌─────────────────┐
│  Generate   │        │  Compress   │        │  Generate Draft │
│  Features   │───────►│  Features   │───────►│  Tokens Fast    │
│  [12288]    │        │  [4096]     │        │  [k tokens]     │
└─────────────┘        └─────────────┘        └────────┬────────┘
                                                       │
                                                       ▼
                                              ┌─────────────────┐
                                              │  Verify Drafts  │
                                              │  with Target    │
                                              └─────────────────┘

How to run EAGLE3 in llama.cpp

Requirements

This PR currently only supports two supports following EAGLE3 models:

The following eagle3 models should also work out of the box, though they haven’t been tested yet:

Step 1: Convert Models to GGUF Format

  • Convert Target Model
TARGET_MODEL_HF="${MODELS_DIR}/Meta-Llama-3.1-8B-Instruct"
TARGET_MODEL_GGUF="${MODELS_DIR}/Meta-Llama-3.1-8B-Instruct_bf16.gguf"

python convert_hf_to_gguf.py \
    "${TARGET_MODEL_HF}" \
    --outtype bf16 \
    --outfile "${TARGET_MODEL_GGUF}"
  • Convert EAGLE3 Draft Model
TARGET_MODEL_HF="${MODELS_DIR}/Meta-Llama-3.1-8B-Instruct"
EAGLE3_MODEL_HF="${MODELS_DIR}/EAGLE3-LLaMA3.1-Instruct-8B"
EAGLE3_MODEL_GGUF="${MODELS_DIR}/EAGLE3-LLaMA3.1-Instruct-8B_fp16.gguf"

python convert_hf_to_gguf.py \
    "${EAGLE3_MODEL_HF}" \
    --outtype f16 \
    --target-model-dir "${TARGET_MODEL_HF}" \
    --outfile "${EAGLE3_MODEL_GGUF}"

Step 2: Compile llama.cpp

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

[Optional] Step 3: Quantize the GGUF model

./build/bin/llama-quantize \
  ${TARGET_MODEL_GGUF} \
  ${TARGET_MODEL_GGUF}_Q4_K_M.gguf \
  Q4_K_M
 
./build/bin/llama-quantize \
  ${EAGLE3_MODEL_GGUF} \
  ${EAGLE3_MODEL_GGUF}_Q4_K_M.gguf \
  Q4_K_M

Step 4: Run EAGLE3 Speculative Decoding

./build/bin/llama-server \
    -m  Qwen3-8B.gguf \
    -md qwen3_8b_eagle3.gguf \
    --spec-type draft-eagle3 \
    --spec-draft-n-max 8 \
    --spec-draft-p-min 0.5 \
    -np 1 \
    -c 4096 --port 8080 -ngl 99 -fa on \
    --jinja --fit off
curl -X POST http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "messages": [
            {"role": "user", "content": "Write a quicksort algorithm in Python. Write code only."}
        ],
        "max_tokens": 256,
        "temperature": 0
    }'

curl -X POST http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "messages": [
            {"role": "user", "content": "Explain the Pythagorean theorem"}
        ],
        "max_tokens": 256,
        "temperature": 0
    }'

Performance Evaluation (RTX A6000 48GB)

Tip

Using the chat_template for each model version can improve acceptance rates. Always apply the model’s corresponding chat_template when constructing prompts.

Note

After refactoring, the performance data below may differ from current results, especially since llama-server now supports Eagle3 as well. However, the data is still useful for getting a general sense of the speedup Eagle3 provides.

  • LLaMA3.1-Instruct-8B with BF16, its Eagle3 with FP16
Prompt Baseline (llama-cli) EAGLE3 (draft_size=8) Accept Rate Speedup
Write a quicksort algorithm in Python. Write code only. 44.5 t/s 146.2 t/s 80.6% 3.28x
Explain the Pythagorean theorem 44.5 t/s 127.1 t/s 77.4% 2.85x
Plan a 1 day trip to DC 44.5 t/s 113.8 t/s 80.9% 2.55x
  • LLaMA3.1-Instruct-8B with Q4_K_M, its Eagle3 with Q4_K_M
Prompt Baseline (llama-cli) EAGLE3 (draft_size=8) Accept Rate Speedup
Write a quicksort algorithm in Python. Write code only. 121.5 t/s 274.4 t/s 92.5% 2.26x
Explain the Pythagorean theorem 121.4 t/s 238.9 t/s 79.4% 1.97x
Plan a 1 day trip to DC 121.4 t/s 196.5 t/s 77.2% 1.62x
  • LLaMA3.3-Instruct-70B with Q4_K_M, its Eagle3 with Q4_K_M
Prompt Baseline (llama-cli) EAGLE3 (draft_size=8) Accept Rate Speedup
Write a quicksort algorithm in Python. Write code only. 15.6 t/s 33.4 t/s 73.6% 2.14x
Explain the Pythagorean theorem 15.6 t/s 37.6 t/s 82.0% 2.41x
Plan a 1 day trip to DC 15.6 t/s 28.8 t/s 69.3% 1.85x
  • Qwen3-8B with BF16, its Eagle3 with BF16
Prompt Baseline (llama-cli) EAGLE3 (draft_size=8) Accept Rate Speedup
Write a quicksort algorithm in Python. Write code only. 43.6 t/s 94.8 t/s 69.8% 2.17x
Explain the Pythagorean theorem 43.6 t/s 86.8 t/s 68.3% 1.99x
Plan a 1 day trip to DC 43.6 t/s 70.7 t/s 57.3% 1.62x
  • Qwen3-14B with BF16, its Eagle3 with BF16
Prompt Baseline (llama-cli) EAGLE3 (draft_size=8) Accept Rate Speedup
Write a quicksort algorithm in Python. Write code only. 24.4 t/s 35.7 t/s 40.4% 1.46x
Explain the Pythagorean theorem 24.4 t/s 34.5 t/s 41.3% 1.41x
Plan a 1 day trip to DC 24.3 t/s 30.5 t/s 28.0% 1.26x
  • Qwen3-32B with Q4_K_M, its Eagle3 with Q4_K_M
Prompt Baseline (llama-cli) EAGLE3 (draft_size=8) Accept Rate Speedup
Write a quicksort algorithm in Python. Write code only. 32.0 t/s 39.7 t/s 39.7% 1.24x
Explain the Pythagorean theorem 32.0 t/s 41.5 t/s 43.3% 1.30x
Plan a 1 day trip to DC 32.0 t/s 37.1 t/s 32.6% 1.16x
  • Qwen3-30B-A3B with BF16, its Eagle3 with BF16 (tested on NVIDIA DGX Spark 128GB, speedup might be better on other hardwares)
Prompt Baseline (llama-cli) EAGLE3 (draft_size=8) Accept Rate Speedup
Write a quicksort algorithm in Python. Write code only. 31.1 t/s 43.3 t/s 64.4% 1.39x
Explain the Pythagorean theorem 31.2 t/s 41.2 t/s 60.6% 1.32x
Plan a 1 day trip to DC 30.9 t/s 38.6 t/s 58.8% 1.25x
Prompt Baseline (llama-cli) EAGLE3 (draft_size=8) Accept Rate Speedup
Write a quicksort algorithm in Python. Write code only. 61.3 t/s 65.05 t/s 74.25% 1.06x
Explain the Pythagorean theorem 61.2 t/s 58.13 t/s 69.23% 0.95x
Plan a 1 day trip to DC 61.4 t/s 54.50 t/s 62.96% 0.89x

Details of GGML backend modifications (Fixed, no longer needed)

In the Eagle3 decoder, two parallel inputs are processed:

input_embeds ──→ RMS_NORM ──┐
                            ├──→ CONCAT ──→ Transformer Decoder
g_embeddings ──→ RMS_NORM ──┘

When both RMS_NORM operations run in the same GPU split, a lack of synchronization causes buffer contention and race conditions (CPU execution is fine as it auto‑syncs between subgraphs).

Solution:
Use ggml_set_sync() to add a synchronization point after the first RMS_NORM, forcing the scheduler to create a split boundary and synchronize before continuing.

input_embeds ──→ RMS_NORM ──→ [SYNC] ──┐
                                       ├──→ CONCAT ──→ Transformer Decoder
g_embeddings ─────────────→ RMS_NORM ──┘
         (split 1)            |         (split 2)
                           barrier

This ensures correct execution and can be applied to any parallel path that needs synchronization, not just Eagle3.

Examples results

examples
  • Prompt: "Write a quicksort algorithm in Python. Write code only."
image
  • Prompt: "Explain the Pythagorean theorem"
image
  • Prompt: "Plan a 1 day trip to DC"
image

Future Steps

  • Support more Eagle3 models, currently support Qwen, GPT-OSS, Llama
  • Currently, Eagle3 is integrated only in llama-speculative-simple, support may need to be extended to other APIs if possible It now supports llama-server
  • Support context-dependent tree sampling (tree attention) as described in the Eagle3 paper to improve accept rate
  • Support batch processing (batch size > 1) with Eagle3 speculative decoding

Comment thread src/models/eagle3.cpp Outdated
@ngxson

ngxson commented Dec 15, 2025

Copy link
Copy Markdown
Collaborator

Judging by the description of this PR, I believe many models with multiple-token prediction also have the same strategy of reusing hidden features from the main model.

It can be quite interesting to generalize this features to support other models. I would expect some kind of sub-llama_context that allow both the main and draft models to share the same cgraph, avoiding the need of explicitly passing the intermediate embedding through the host memory.

@ggerganov

Copy link
Copy Markdown
Member

It can be quite interesting to generalize this features to support other models.

I will definitely be looking at refactoring the implementation to become more generic before merging it. The initial results in terms of performance are really great, but we'll need to work on cleaning up the code and reduce the special-casing in several places. I'll try to provide insights how to do that in the next days.

@ruixiang63

Copy link
Copy Markdown
Contributor Author

It can be quite interesting to generalize this features to support other models.

I will definitely be looking at refactoring the implementation to become more generic before merging it. The initial results in terms of performance are really great, but we'll need to work on cleaning up the code and reduce the special-casing in several places. I'll try to provide insights how to do that in the next days.

Thanks @ggerganov @ngxson for your inputs. Definitely, looking forward to hearing your feedback and improving this PR.

Comment thread src/models/eagle3.cpp
Comment thread src/models/llama.cpp Outdated
Comment on lines +26 to +35
// EAGLE3: Extract intermediate layer features from target model at layer INPUT
if (eagle3 && cparams.eagle3_extract_enabled && !eagle3->extract_layer_indices.empty()) {
static const char * eagle3_extract_names[] = {"eagle3_extract_0", "eagle3_extract_1", "eagle3_extract_2"};
for (size_t i = 0; i < eagle3->extract_layer_indices.size() && i < 3; ++i) {
if (eagle3->extract_layer_indices[i] == il) {
cb(inpL, eagle3_extract_names[i], il);
break;
}
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will next look to remove this ad hoc logic and generalize it some way. Likely by passing the extraction points in some more generic way during llama_context creation. TBD

Comment thread src/llama-hparams.h
Comment on lines +195 to +198

// EAGLE3 draft model - target model hidden size
uint32_t eagle3_target_hidden_size = 0;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can become more generic by renaming it to n_embd_enc and utilizing the n_embd_inp() call.

Comment thread include/llama.h Outdated
Comment on lines +875 to +878
// Get pointer to target model features extracted for EAGLE3 encoder
// Returns NULL if no features are available
// Format: [3*n_embd, n_tokens] - use model.hparams.n_embd and batch.n_tokens for dimensions
LLAMA_API const float * llama_get_eagle3_target_features(struct llama_context * ctx);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call should become more generic and not Eagle3 specific. Will be looking how to achieve this in the best way.

Comment thread include/llama.h Outdated
Comment on lines +880 to +887
// Set g_embeddings from EAGLE3 encoder output for decoder input
// g_embd: pointer to encoder output embeddings
LLAMA_API void llama_set_eagle3_g_embeddings(
struct llama_context * ctx,
const float * g_embd,
int32_t n_embd,
int32_t n_tokens);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be possible to avoid this API if we combine the Eagle encoder and decoder in a single context. TBD

@ruixiang63 ruixiang63 Dec 17, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When combining the Eagle3 encoder and decoder into a single context, note that the Eagle3 encoder is used only to fuse the extracted features from the target model, i.e. it is invoked as many times as the target model itself. The Eagle3 decoder, on the other hand, is solely responsible for generating draft tokens in autoregressive way.
llama_set_eagle3_g_embeddings() sets the g_embedding both from the Eagle3 encoder (used in the first generation step of the Eagle3 decoder) and from the Eagle3 decoder itself (used in subsequent generation steps).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I noticed this interaction. We don't have a previous use case similar to this, but I think the enc-dec context could be adapted accordingly.

@pwilkin

pwilkin commented Jan 6, 2026

Copy link
Copy Markdown
Member

Bumping, is there any progress on this? It's probably one of the more coveted features to have right now.

@ggerganov

Copy link
Copy Markdown
Member

Bumping, is there any progress on this?

I'm currently side-tracked by some graph reallocation optimizations. Will probably come back to this after that.

@pwilkin pwilkin added the hot Something that is hot label Jan 6, 2026
@ruixiang63

Copy link
Copy Markdown
Contributor Author

Eagle3 checkpoints for the Qwen3 series (including both dense and MoE models) are now supported, see the updated PR description for details.
Although these Eagle3 checkpoints are from third party, they can still deliver a 1–2× speedup.
Speculative decoding performance for MoE models is not as good as dense models, which is expected, since more experts are invoked during the parallel verification phase than during the target model’s decoding phase.

@ruixiang63

ruixiang63 commented Jan 9, 2026

Copy link
Copy Markdown
Contributor Author

One question: it seems that CUDA Graph is disabled when the input n_tokens > 1. During the target model verification stage of speculative decoding, CUDA Graph is always disabled for the target model, since it’s only used for verification with multiple draft tokens > 1. However, we can fix the number of draft tokens (e.g., by using padding) to make it constant and thus enable CUDA Graph (may need to remove n_tokens > 1 constraint)? @ggerganov

Context: I’m testing GPT-OSS-120B Eagle3 with llama.cpp, and I found that even with Eagle3 (accept rate 86%), the performance is worse than the naive llama-cli. After profiling, I discovered that CUDA Graph is consistently disabled for the target model during speculative decoding, whereas it remains enabled in llama-cli. This results in the target model’s verification(prefiling) phase being roughly >5× times slower compared to normal autoregressive decoding step.
After disabling CUDA graphs for llama-cli using GGML_CUDA_DISABLE_GRAPHS=1, the eagle3 achieved roughly a 1.5× speedup.

I’ve only observed this performance issue with GPT-OSS-120B Eagle3. For other models, even without CUDA Graph enabled for target model in Eagle3 speculative decoding, the performance remains great.

@ggerganov

Copy link
Copy Markdown
Member

Speculative decoding performance for MoE models is not as good as dense models, which is expected, since more experts are invoked during the parallel verification phase than during the target model’s decoding phase.

I think the small-batch mul_mat_id could be improved in the CUDA backend. AFAIR there the performance for batch sizes (1, 8] is not optimal atm. Need double check.

However, we can fix the number of draft tokens (e.g., by using padding) to make it constant and thus enable CUDA Graph (may need to remove n_tokens > 1 constraint)? @ggerganov

Possibly, but to me this sounds like second-order optimization. Optimizing the mul_mat_id for small batches should bring more generic benefits and would likely have larger impact for speculative decoding compared to enabling CUDA graphs.

After disabling CUDA graphs for llama-cli using GGML_CUDA_DISABLE_GRAPHS=1, the eagle3 achieved roughly a 1.5× speedup.

Hm, this is a bit surprising observation. Can you run a llama-batched-bench test on your system with and without CUDA graphs using the commands from #18308 (comment) and share the results. We are interested in batch sizes [1, 4]. So something like this:

llama-batched-bench -m [gpt-oss-120b] -c 65536 -b 2048 -ub 512 -npp 1024 -ntg 32 -npl 1,2,3,4,5,6,7,8

@ruixiang63

Copy link
Copy Markdown
Contributor Author

Thanks very much for your inputs! @ggerganov

After disabling CUDA graphs for llama-cli using GGML_CUDA_DISABLE_GRAPHS=1, the eagle3 achieved roughly a 1.5× speedup.

Hm, this is a bit surprising observation. Can you run a llama-batched-bench test on your system with and without CUDA graphs using the commands from #18308 (comment) and share the results. We are interested in batch sizes [1, 4]. So something like this:

I double-checked the run today. The previous statement about cuda graph was incorrect due to instability and concurrent CPU activity in my test environment, sorry about that! Currently, enabling or disabling CUDA Graphs doesn’t have much impact in llama-cli for GPT-OSS-120B model. (I am testing on DGX Spark)

  • with cuda graph enabled: [ Prompt: 120.1 t/s | Generation: 47.2 t/s ]
  • without cuda graph enabled: [ Prompt: 119.2 t/s | Generation: 45.7 t/s ]

Also, the results for llama-batched-bench:

  • with cuda graph enabled
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
1024 256 1 1280 1.227 834.89 5.255 48.71 6.482 197.47
1024 256 2 2560 1.579 1296.74 9.277 55.19 10.856 235.81
1024 256 3 3840 2.284 1344.72 10.447 73.51 12.731 301.61
1024 256 4 5120 3.031 1351.58 11.550 88.66 14.580 351.16
1024 256 5 6400 3.780 1354.59 12.433 102.96 16.212 394.76
1024 256 6 7680 4.528 1356.95 13.347 115.08 17.874 429.66
1024 256 7 8960 5.304 1351.48 13.982 128.16 19.286 464.59
1024 256 8 10240 6.018 1361.20 14.704 139.28 20.722 494.16
  • without cuda graph enabled
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
1024 256 1 1280 1.279 800.61 5.758 44.46 7.037 181.90
1024 256 2 2560 1.597 1282.22 9.297 55.07 10.895 234.98
1024 256 3 3840 2.286 1343.84 10.383 73.97 12.669 303.11
1024 256 4 5120 3.031 1351.51 11.547 88.68 14.577 351.23
1024 256 5 6400 3.771 1357.64 12.438 102.91 16.209 394.84
1024 256 6 7680 4.525 1357.71 13.342 115.12 17.868 429.83
1024 256 7 8960 5.289 1355.17 13.986 128.13 19.275 464.84
1024 256 8 10240 5.999 1365.48 14.653 139.77 20.652 495.83

Possibly, but to me this sounds like second-order optimization. Optimizing the mul_mat_id for small batches should bring more generic benefits and would likely have larger impact for speculative decoding compared to enabling CUDA graphs.

I agree. CUDA graphs could be second-order optimization.
Here are the eagle3 GPT-OSS-120B test results on DGX spark: (I will also test this on other hardwares)

Prompt Baseline (llama-cli) EAGLE3 (draft_size=8) Accept Rate Speedup
Write a quicksort algorithm in Python. Write code only. 48.3 t/s 52.2 t/s 85.0% 1.08x
Explain the Pythagorean theorem 47.8 t/s 46.5 t/s 74.0% 0.97x
Plan a 1 day trip to DC 48.4 t/s 40.0 t/s 55.7% 0.83x

For MoE models, prefilling becomes the main performance bottleneck because more active experts are involved. As a result, the assumption that “processing multiple draft tokens concurrently is as fast as processing a single token” no longer holds, which is an important condition for effective speculative decoding. I also saw that as the draft token length increases, the verification cost of the target model also rises.
This explains the results shown in the table above, in some cases, Eagle3 can even degrade performance. To observe improvements, the accept rate must exceed a certain lower bound.

Do you have any rough ideas that how much performance gain we can get through imporving mul_mat_id?

@ggerganov

Copy link
Copy Markdown
Member

The llama-batched-bench results are actually better than I expected. In the previous reported numbers there was a sharp dip at BS = 2. Here the TG performance steadily increases with the batch size which is good, though it is not as linear as we want it to be.

I suppose the explanation is that for MoE models, at low batch sizes the amount of data we need to read from the weights for each batch increases linearly with the batch size (i.e. each extra token in the batch activates more experts and at small batch size the experts for each token are very likely different from each other). So it's probably normal that TG for MoE does not scale as well as TG for dense models as a function of the batch size.

As a result, the assumption that “processing multiple draft tokens concurrently is as fast as processing a single token” no longer holds, which is an important condition for effective speculative decoding.

Yeah, that's my guess as well. Do we have some references to cross-check this? Does the Eagle3 authors discuss it's performance for MoE models? Do we have sample numbers for gpt-oss-120 with Eagle3 using vllm, trrt?

Do you have any rough ideas that how much performance gain we can get through imporving mul_mat_id?

Hm, not sure. Thinking about it now, I feel like mul_mat_id is unlikely to scale good enough due to the increasing data for each new token.

@arch-btw

Copy link
Copy Markdown
Contributor

The following eagle3 models should also work out of the box, though they haven’t been tested yet:
Qwen3-235B-A22B-EAGLE3

I tested the Baichuan-M3-235B model that was released yesterday (draft here). It's a finetune of the Qwen3 model above. It quantized successfully but failed due to having a different tensor shape (even in the original weights):

load_tensors: EAGLE3 using d2t mapping (draft_vocab_size = 32000)
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected  8192,  8192, got  8192, 16384,     1,     1
llama_model_load_from_file_impl: failed to load model
failed to load EAGLE3 draft model

I haven't looked into how often this to happen in finetunes of the same model, especially in the context of eagle3.

However, the shapes of the tensors changing might be something to account for in the implementation (in this case Qwen3). Unless those will be treated as completely new models, in which case please disregard this comment.

@ruixiang63

Copy link
Copy Markdown
Contributor Author

The following eagle3 models should also work out of the box, though they haven’t been tested yet:
Qwen3-235B-A22B-EAGLE3

I tested the Baichuan-M3-235B model that was released yesterday (draft here). It's a finetune of the Qwen3 model above. It quantized successfully but failed due to having a different tensor shape (even in the original weights):

load_tensors: EAGLE3 using d2t mapping (draft_vocab_size = 32000)
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected  8192,  8192, got  8192, 16384,     1,     1
llama_model_load_from_file_impl: failed to load model
failed to load EAGLE3 draft model

I haven't looked into how often this to happen in finetunes of the same model, especially in the context of eagle3.

However, the shapes of the tensors changing might be something to account for in the implementation (in this case Qwen3). Unless those will be treated as completely new models, in which case please disregard this comment.

I spent some time analyzing the Baichuan-EAGLE3 draft model. It has a slightly different architecture compared to the standard Qwen3-EAGLE3 model.
The main difference is in the self_attn.q_proj.weight tensor shape:

  • Standard Qwen3-EAGLE3 : [8192, 8192] — outputs Q only
  • Baichuan-EAGLE3: [16384, 8192] — outputs Q + Gate (2× the size)

This is because Baichuan-EAGLE3 uses an Attention Output Gate mechanism, which is not present in the standard EAGLE3 model. In this variant:

  • The Q projection outputs both query vectors and gate vectors
  • After attention computation, the output is element-wise multiplied by sigmoid(gate) before the output projection

This is essentially a variant architecture of EAGLE3, not just a tensor shape difference. Supporting this variant would require:

  • Detecting the gate mechanism during model loading
  • Modifying the graph construction to split Q/Gate and apply the gating after attention
  • Adding the ggml_sigmoid operation in the attention path

I would suggest we focus this PR on the standard EAGLE3 model first. Once merged, we can consider adding support for this gated variant in a follow-up PR.

Have you tested the standard Qwen3-EAGLE3 model as well? Does it work well with the current implementation? If yes, could you please share the t/s and speedup you got with eagle3? @arch-btw

@ngxson

ngxson commented Jan 14, 2026

Copy link
Copy Markdown
Collaborator

Since EAGLE3 can vary quite a lot for each model, maybe a better way is to consider it as an adapter (the same logic as lora adapter), instead of a dedicated arch?

That way, it can hook into existing models more easily, making internal data like KV state, gate, etc, accessible to the draft model.

@ruixiang63

Copy link
Copy Markdown
Contributor Author

Since EAGLE3 can vary quite a lot for each model, maybe a better way is to consider it as an adapter (the same logic as lora adapter), instead of a dedicated arch?

That way, it can hook into existing models more easily, making internal data like KV state, gate, etc, accessible to the draft model.

Good point. However, Eagle3 doesn’t vary much across models. So far, except for Baichuan-Eagle3, all other models essentially use the same Eagle3 architecture. Please refer to the supported models listed in the PR description. I’d say the majority of models share the same Eagle3 architecture, with only a few exceptions. This standalone Eagle3 architecture strategy is also adopted in TensorRT-LLM, vLLM, and SGLang.

@ngxson

ngxson commented Jan 14, 2026

Copy link
Copy Markdown
Collaborator

I doubt that. In theory, nothing prevent them or another team from making a variant of eagle3 that get the state of more than 3 layers, or even reuse the KV state from earlier layers. Possibilities are endless, and that's why it's important to think about the bigger picture instead of just trying to make it work with one single existing architecture.

I think a more model-agnostic approach via adapter API (or another API based on that form) will likely be the way ultimately. It will allow computing both the next token + draft token in one pass, allowing even higher performance than this approach.

@ruixiang63

Copy link
Copy Markdown
Contributor Author

I doubt that. In theory, nothing prevent them or another team from making a variant of eagle3 that get the state of more than 3 layers, or even reuse the KV state from earlier layers. Possibilities are endless, and that's why it's important to think about the bigger picture instead of just trying to make it work with one single existing architecture.

Could you please share some examples or real-world use cases of this? I’d like to better understand how such an approach might be applied in practice.

@ngxson

ngxson commented Jan 14, 2026

Copy link
Copy Markdown
Collaborator

The main problem with this PR and #15225 is that both assumes that the MTP (multi-token prediction) to work this way:

  • main LLM generates first tokens + hidden_state from a list of selected layers
  • hidden_state is then forwarded to the speculative model to generate N next tokens

(Note: the dash line is to tell that it's may not be the case for all models; some only use the last hidden state)

image

While it does work for the moment, this approach doesn't address the true nature of MTP models. In other words, it is not truly model-agnostic. The main drawbacks is that you must manually pass the embeddings between 2 models, so you must know where to get the embeddings, its shapes, etc.

Instead, we should look at MTP models as a normal LLM with multiple output heads:

image

In this POV, it's not matter what is the implementation of the mtp_head. From the outside world, the model will just output N next tokens given one input token.

In practice, the mtp_head(s) can be:

Now, returning to your question:

Could you please share some examples or real-world use cases of this? I’d like to better understand how such an approach might be applied in practice.

If you already get the idea above, then consider gemma3n: the model has 30 layers, but only 20 layers has KV projection. The last 10 layers reuse the KV from the 20-th layer. Some models also implement this idea, notably GLM, bailing.

The same idea can be apply to MTP layers. Future models may has MTP layers to not just reuse the layer output hidden state, but also the projected KV inside the layer. While there is no models in the wild currently doing that, Baichuan-EAGLE3 (as you shown), already someway heading towards this direction by exposing both the Q+gate to the MTP model.

@ngxson

ngxson commented Jan 14, 2026

Copy link
Copy Markdown
Collaborator

(I have to split up my comment otherwise it's too long)

My proposal is that we must design this function + the API in a way that it is flexible enough for future models.

For EAGLE3, the MTP model is technically a mtp_head shipped as an extension to the main model (note that the eagle3 repo only contains the extra tensors, but does not contain the main LLM), it can be viewed as an adapter, much like how LoRA works.

For the API, we must avoid leaking the information about the implementation under the hood. The downstream code must only know about how many tokens can be generated, they don't need to know how to generate these extra tokens.

So, an array of API as follow should be enough:

  • llama_model_load_mtp: load the mtp as a llama_adapter_lora or maybe we can add a new struct for it
  • llama_mtp_set_n_draft: set the max number of draft tokens to be generated in the next llama_decode; set to 0 for verification pass
  • llama_mtp_get_n_draft_max: get max number of draft tokens that the MTP head can generate
  • llama_mtp_get_logits_ith: get logits at for i-th token in batch, returns array of float with size n_vocab*n_draft

All the info about embeddings and the draft model must be kept private.

CC @ggerganov maybe this is helpful for you

@ggerganov ggerganov marked this pull request as ready for review June 11, 2026 12:26
@ggerganov ggerganov requested review from a team, CISC and JohannesGaessler as code owners June 11, 2026 12:26
@ggerganov ggerganov marked this pull request as draft June 11, 2026 12:26
@ggerganov ggerganov marked this pull request as ready for review June 11, 2026 14:08
@ggerganov

Copy link
Copy Markdown
Member

Should be good to merge after review of the Python code.

Comment thread conversion/llama.py

# target_layers: derived from target model layer count (low/mid/high)
target_num_layers = target_config["num_hidden_layers"]
target_layers = [2, target_num_layers // 2, target_num_layers - 3]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also prefer the eagle3 config when eagle_aux_hidden_state_layer_ids is present?

Same question for vocab size when draft_vocab_size exists

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also prefer the eagle3 config when eagle_aux_hidden_state_layer_ids is present?

Good question. First, many Eagle3 checkpoints do not include eagle_aux_hidden_state_layer_ids. Also, different Eagle3 checkpoints interpret layer_ids differently: some expect the IDs to be set before extracting the layers, while others expect them to be set afterward, which can sometimes require adding +1.
To avoid this ambiguity, I decided to compute the values manually based on the original paper and its implementation, rather than relying on the Eagle3 config. This ensures that the target layers are 100% correct without postprocessing and keep code logic aligned.

Same question for vocab size when draft_vocab_size exists

Both draft_vocab_size and the target model’s vocab_size are needed when performing the d2t vocab mapping for Eagle3. The target model’s vocab_size serves as an assertion to ensure that the d2t mapping does not go out of vocabulary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes, guess the EAGLE3 rollout has not been smooth 😅

thanks for the clarity! unfortunate but logical :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it is. Thanks for the review!

@bartowski1182

Copy link
Copy Markdown
Contributor

I notice some of the more recent uploads on huggingface (like https://huggingface.co/nvidia/Kimi-K2.6-Eagle3) don't use any of the listed archs, instead using Eagle3DeepseekV2ForCausalLM

Is there any way we can understand Eagle3 at the start of the arch as an eagle3 draft checkpoint and route it the same way?

@ruixiang63

ruixiang63 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

I notice some of the more recent uploads on huggingface (like https://huggingface.co/nvidia/Kimi-K2.6-Eagle3) don't use any of the listed archs, instead using Eagle3DeepseekV2ForCausalLM

Is there any way we can understand Eagle3 at the start of the arch as an eagle3 draft checkpoint and route it the same way?

Oh, thanks for sharing this. I wasn’t aware that we have a new Kimi 2.6 Eagle3 model checkpoint. (This is great, eagle3 ecosystem is still growing, so it is great we have it in llama.cpp now).
The config looks different, and I haven’t had a chance to look into it yet. I’m not sure how they implemented Eagle3 here, maybe they used a decoder layer from DeepSeekV2?
All the Eagle3 checkpoints I’ve seen so far are based on Llama’s decoder layer, which matches the original paper. Maybe this one is different.

I also found another Kimi 2.6 Eagle3 config that seems aligned with our current approach: https://huggingface.co/lightseekorg/kimi-k2.6-eagle3/blob/main/config.json

Comment thread conversion/base.py
Comment thread conversion/base.py
Comment thread conversion/base.py
Comment thread conversion/llama.py
raise ValueError(f"EAGLE-3 d2t target ids out of range for target vocab size {self.target_vocab_size}")
if np.unique(data).size != data.size:
raise ValueError("EAGLE-3 d2t contains duplicate target ids")
data_qtype = gguf.GGMLQuantizationType.I64

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future-proofing is nice and all, but n_tokens and token ids are limited to int32_t, what is the original dtype?

@ruixiang63 ruixiang63 Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original d2t dtype is torch.int64 in the eagle3 checkpoint. That's why I used I64 to preserve that and avoid any accidental truncation after converting it to absolute target ids.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread conversion/llama.py
self.origin_hf_arch = hparams.get('architectures', [None])[0]

# Detect eagle3 draft checkpoint by hparams (some models don't use a distinct HF arch name)
if "draft_vocab_size" in self.hparams and self.hparams["num_hidden_layers"] == 1:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important right now, but I'm guessing all this will basically be duplicated for every arch supported with very little if any differences? Would be nice if it can be refactored in a reusable way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I kept this local for now because all Eagle3 checkpoints I have encountered so far are based on Llama decoder (no matter where they come from RedHat, LMSYS, NVIDIA, etc), and this PR only targets that path unless we find an Eagle3 checkpoint based on a different architecture. (potentially this #18039 (comment) but not sure).

If another architecture needs Eagle3 conversion later, this should be the first piece to factor out.

@pwilkin pwilkin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, would probably prefer the property to live in llama.py since Eagle is the only arch that uses it, but I'll let @CISC decide that one :)

@CISC

CISC commented Jun 11, 2026

Copy link
Copy Markdown
Member

Okay, would probably prefer the property to live in llama.py since Eagle is the only arch that uses it, but I'll let @CISC decide that one :)

That will change though, so makes sense that it's in base.

@ruixiang63

Copy link
Copy Markdown
Contributor Author

Okay, would probably prefer the property to live in llama.py since Eagle is the only arch that uses it, but I'll let @CISC decide that one :)

Thanks for approval!
Makes sense. I kept it in ModelBase because target_model_dir is passed through the common converter constructor, even though Eagle3 is the only current consumer.
DFlash would be another consumer in the future :) #22105

Comment thread conversion/llama.py Outdated
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@ruixiang63

Copy link
Copy Markdown
Contributor Author

Thanks @bartowski1182 @pwilkin @CISC for the review! I guess this PR is good to merge :) CC @ggerganov

@CISC

CISC commented Jun 11, 2026

Copy link
Copy Markdown
Member

Let's see if we win the \r\n lottery today...

Edit: Yay, green: https://github.com/ggml-org/llama.cpp/actions/runs/27377588058/job/80906815068?pr=18039

@ggerganov ggerganov merged commit 88a3927 into ggml-org:master Jun 12, 2026
28 of 29 checks passed
@AbdulrahmanHashem

Copy link
Copy Markdown

i cannot find qwen3.6 eagle3 if it exist can someone please link it.

@laurentpayot

Copy link
Copy Markdown

i cannot find qwen3.6 eagle3 if it exist can someone please link it.

Btw, direct GGUF files would be greatly appreciated 😅

@sswtodo

sswtodo commented Jun 12, 2026

Copy link
Copy Markdown

Thank you for introducing Eagle3 !

Here are benchmarks from ruixiang63 -> #18039 (comment)

I have did additional benchmarks And here are results FreeBSD + 7900 xtx :

* [RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3](https://huggingface.co/RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3)
* [gemma-4-26B-A4B-it-GGUF Q4_K_M)[https://huggingface.co/ggml-org/gemma-4-26B-A4B-it-GGUF]
* [Assistant IT Q8_0] (https://huggingface.co/google/gemma-4-26B-A4B-it-assistant)

reasoning off

Prompt Baseline EAGLE3 (draft_size=1) Speedup
Write a quicksort algorithm in Python. Write code only. 120.81 t/s 128.68 t/s (89.4%) 1.06x
Explain the Pythagorean theorem 119.30 t/s 127.06 t/s (84%) 1.06x
Plan a 1 day trip to DC 115.70 t/s 103.24 t/s (75.6%) 0.89x
Paste file content llama-context.cpp 94.87 t/s 75.77 t/s (41.9%) 0.79x
Prompt Baseline Assistant (draft_size=1) Speedup
Write a quicksort algorithm in Python. Write code only. 120.81 t/s 169.12 t/s (94.4%) 1.39x
Explain the Pythagorean theorem 119.30 t/s 164.68 t/s (89.9%) 1.38x
Plan a 1 day trip to DC 115.70 t/s 146.29 t/s (71.9%) 1.26x
Paste file content llama-context.cpp 94.87 t/s 116.51 t/s (73.6%) 1.23x

According to the above, the Assistant MTP issue was resolved in PR #24277. However, I’m still seeing a similar performance regression on Eagle 3 — while working with an agentic tool like claude-code, I’m observing a significant drop in draft acceptance, ranging from 9% to 33% together with performance drop.

Advice is appreciated

@fafinet

fafinet commented Jun 12, 2026

Copy link
Copy Markdown

It crashes on load with -sm tensor for me using gemma 4 31b and redhat's eagle3 model converted to bf16.

0.57.900.481 I srv    load_model: initializing slots, n_slots = 4
0.58.099.454 I common_speculative_impl_draft_eagle3: adding speculative implementation 'draft-eagle3'
0.58.099.462 I common_speculative_impl_draft_eagle3: - n_max=8, n_min=0, p_min=0.500000
/home/local/downloads/llama.cpp/ggml/src/ggml-backend-meta.cpp:728: GGML_ASSERT(src_ss[0].axis != GGML_BACKEND_SPLIT_AXIS_1) failed
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(+0x18665) [0x7f8670d5f665]
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x1df) [0x7f8670d5fa3f]
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x11e) [0x7f8670d5fbce]
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(+0x41f2f) [0x7f8670d88f2f]
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(+0x39a75) [0x7f8670d80a75]
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(+0x42cac) [0x7f8670d89cac]
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(+0x45420) [0x7f8670d8c420]
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(ggml_gallocr_alloc_graph+0x483) [0x7f8670d74c23]
/home/local/downloads/llama.cpp/build/bin/libggml-base.so.0(ggml_backend_sched_alloc_graph+0x111) [0x7f8670d7af11]
/home/local/downloads/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xbd) [0x7f86704d291d]
/home/local/downloads/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x368) [0x7f86704d83d8]
/home/local/downloads/llama.cpp/build/bin/libllama.so.0(llama_decode+0xb) [0x7f86704d9fbb]
/home/local/downloads/llama.cpp/build/bin/libllama-common.so.0(_Z25common_context_can_seq_rmP13llama_context+0xc8) [0x7f86709e3b08]
/home/local/downloads/llama.cpp/build/bin/libllama-server-impl.so(_ZN19server_context_impl10load_modelER13common_params+0xcd4) [0x7f8671572314]
/home/local/downloads/llama.cpp/build/bin/libllama-server-impl.so(_Z12llama_serveriPPc+0x2ece) [0x7f86714c57de]
/lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f8670e35ca8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f8670e35d65]
./.llama-cpp/llama-server(+0x11b1) [0x55c5ab6f21b1]
Aborted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning hot Something that is hot model Model specific python python script changes server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Support EAGLE3 models for draft model / speculative decoding use cases