convert: add eagle2 draft arch#13908
Conversation
|
|
||
|
|
||
| @ModelBase.register("Eagle2DraftForCausalLM") | ||
| class Eagle2DraftModel(TextModel): |
There was a problem hiding this comment.
Since this is a Qwen2 model and it has no distinguishing conversion features, make it inherit from Qwen2Model and leave out everything but model_arch.
There was a problem hiding this comment.
Since this is a Qwen2 model and it has no distinguishing conversion features, make it inherit from
Qwen2Modeland leave out everything butmodel_arch.
Fixed in latest commit , now inherits from Qwen2Model with only model_arch override.
|
|
||
| # eagle2 draft model | ||
| MODEL_TENSOR.FC: ( | ||
| "model.fc", |
There was a problem hiding this comment.
I'm not sure about this name (also it appears to be just fc in your GGUF), what is the purpose of this tensor?
There was a problem hiding this comment.
I'm not sure about this name (also it appears to be just
fcin your GGUF), what is the purpose of this tensor?
The EAGLE-2 paper doesn't express this mechanism particularly clearly, but you can refer to Figure 6 in the EAGLE-1 paper at https://arxiv.org/html/2401.15077v3.
The FC layer concatenates the current hidden_states with the hidden_states passed from previous inference steps, forming a tensor of 2 * config.hidden_size dimensions. This FC layer then maps the 2 * config.hidden_size dimensional tensor back to a config.hidden_size dimensional tensor, as demonstrated in the code at https://github.com/SafeAILab/EAGLE/blob/main/eagle/model/cnets1.py#L524.
e0714aa to
ec22e4b
Compare
|
you know, we are at a weird stage where eagle2 got deprecated before the code got even merged into llama.cpp. |


EAGLE-2 Speculative Decoding Support for llama.cpp - Phase 1 Submission
Overview
This PR introduces EAGLE-2 (Extrapolation Algorithm for Greater Language-model Efficiency) speculative decoding support for llama.cpp. EAGLE-2 is an advanced speculative decoding technique that uses a smaller draft model to accelerate inference of larger target models.
Paper: EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Official Repository: https://github.com/SafeAILab/EAGLE
Implementation Status
✅ EAGLE-2 is fully implemented and functional - complete with model conversion, loading, inference, and speculative decoding algorithm.
For maintainability and review efficiency, we’re submitting the implementation in multiple focused phases rather than as one large PR.
Phase 1: Model Conversion Infrastructure
This submission focuses on the model conversion component - enabling EAGLE-2 draft models to be converted from safetensors to GGUF format.
Why Phased Submission?
We've chosen a phased submission approach for the following reasons:
Submission Roadmap
All phases are already implemented and tested - this roadmap represents submission phases only.
Current Submission Features
Enhanced Model Conversion
convert_hf_to_gguf.pyModel Size Optimization
EAGLE-2 draft models offer significant size advantages:
Draft models achieve extreme compression as they extract and optimize single decoder layers from the original model.
Demonstrated Performance
The complete EAGLE-2 implementation has been tested on NVIDIA RTX 4080 with few data come from ShareGPT dataset:
Qwen2-7B-Instruct + EAGLE-Qwen2-1.4B Draft Model
Performance Notes:
Technical Implementation
Architecture Support
convert_hf_to_gguf.pywith EAGLE-2 model detectionMemory Efficiency
Testing Resources
Pre-converted Models
For immediate testing and validation:
EAGLE-Qwen2 Draft Model (F16):
Testing the Conversion
# Convert EAGLE-2 draft model to GGUF python convert_hf_to_gguf.py /path/to/eagle-draft-model --outfile eagle-draft.ggufCurrent Implementation Characteristics
Operational Parameters
Architecture Design
Future Submissions
The remaining implementation components will be submitted in subsequent PRs:
Acknowledgments
This Phase 1 submission provides the model conversion foundation for EAGLE-2 speculative decoding in llama.cpp. The complete implementation demonstrates significant inference acceleration while maintaining compatibility with the llama.cpp ecosystem.