Summary
Gemma 4 E2B/E4B models rely on Per-Layer Embeddings (PLE) as a core architectural feature.
While llama.cpp's GGUF loader correctly reads the PLE-related metadata keys (e.g. gemma4.embedding_length_per_layer_input), the forward computation graph appears to lack the full PLE pipeline. This means E2B/E4B models run without crashing, but the auxiliary per-layer residual signal is never injected into the decoder layers, leading to subtly degraded output quality.
Background & References
Google DeepMind's official transformers implementation recently merged PR #45207, which documents the complete PLE pipeline. Key facts from that PR:
- PLE is mandatory for E2B/E4B: These models set
hidden_size_per_layer_input > 0 (e.g. 256).
The 31B Dense model sets it to 0 and does not use PLE.
- Packed embedding weight:
embed_tokens_per_layer has shape [vocab_size_per_layer_input, num_hidden_layers * hidden_size_per_layer_input].
- Two-component pipeline:
- Token-identity:
input_ids → lookup in embed_tokens_per_layer → reshape to [B, S, num_layers, ple_dim] → scale by sqrt(ple_dim).
- Context-aware:
inputs_embeds → per_layer_model_projection (Linear, no bias) → scale by 1/sqrt(hidden_size) → reshape to [B, S, num_layers, ple_dim] → RMSNorm(eps=rms_norm_eps).
- Combine:
(token_identity + context_aware) * (1/sqrt(2)).
- Per-layer injection: Each decoder layer
i receives per_layer_inputs[:, :, i, :] as an auxiliary residual signal.
- Multimodal fallback: When
input_ids are unavailable (image/audio features replace placeholder tokens), the implementation reverses the main embedding to recover input_ids for the PLE lookup, or falls back to context-aware only.
Additional Concern: Quantization & MMap Compatibility
PLE weights (per_layer_embed, per_layer_model_projection,
per_layer_projection_norm) are highly sensitive to quantization noise
due to subsequent scalar multiplication and RMSNorm.
Questions:
- Does
convert_hf_to_gguf.py currently exclude PLE tensors from
default quantization, keeping them in bf16/f16?
- If PLE tensors are quantized in GGUF, does llama.cpp's mmap loader
handle their dequantization correctly during embedding lookup?
- Are there backend limitations (CUDA/Metal/Vulkan) for bf16 embedding
lookup that would require format conversion at load time?
Reference: mlx-gemma4 project's "PLE-Safe Quantization Strategy"
demonstrates that quantizing PLE paths causes catastrophic output
degradation in Gemma 4 E2B/E4B.
Current behavior in llama.cpp
llama_model_loader reads gemma4.embedding_length_per_layer_input correctly.
- The PLE weights (
per_layer_embed, per_layer_model_projection, per_layer_projection_norm) are present in the GGUF and loaded.
- However, there is no evidence in the codebase that
get_per_layer_inputs() and project_per_layer_inputs() equivalents are executed during forward pass. The residual stream in each decoder layer does not appear to be conditioned by the per-layer signal.
Expected behavior
E2B/E4B models should produce logits consistent with the reference transformers implementation. This requires the full PLE pipeline to be wired into the GGML compute graph.
Environment
- llama.cpp version: master (post b8765)
- Models:
gemma-4-E2B-it, gemma-4-E4B-it (any quantization)
- Impact: Quality degradation (not a crash), making E2B/E4B unreliable for production use via llama.cpp.
Other frameworks status
| Framework |
PLE Status |
| transformers |
✅ Reference implementation |
| vLLM |
✅ Full support (including scale buffers, OOV guards, PP adaptations) |
| MLX (Python) |
✅ Day-1 support |
| TensorRT-LLM |
✅ Listed as supported architecture |
| llama.cpp |
⚠️ Loader only; forward graph incomplete |
Request
- Is PLE support on the llama.cpp roadmap for Gemma 4?
- If not yet planned, would the maintainers accept a community PR implementing the above pipeline?
- Are there any known blockers (e.g. GGML operator limitations, graph builder constraints) that would complicate the per-layer slicing and injection?
The official docstrings from PR #45207 provide sufficient specification to implement this without reverse-engineering.
Related
Operating systems
Windows
GGML backends
CUDA
Hardware
Intel LunarLake / nv 5070
Models
https://huggingface.co/google/gemma-4-31B-it
Problem description & steps to reproduce
Gemma 4 E2B/E4B models rely on Per-Layer Embeddings (PLE) as a core architectural feature.
While llama.cpp's GGUF loader correctly reads the PLE-related metadata keys (e.g. gemma4.embedding_length_per_layer_input), the forward computation graph appears to lack the full PLE pipeline. This means E2B/E4B models run without crashing, but the auxiliary per-layer residual signal is never injected into the decoder layers, leading to subtly degraded output quality.
First Bad Commit
No response
Relevant log output
Logs
Summary
Gemma 4 E2B/E4B models rely on Per-Layer Embeddings (PLE) as a core architectural feature.
While llama.cpp's GGUF loader correctly reads the PLE-related metadata keys (e.g.
gemma4.embedding_length_per_layer_input), the forward computation graph appears to lack the full PLE pipeline. This means E2B/E4B models run without crashing, but the auxiliary per-layer residual signal is never injected into the decoder layers, leading to subtly degraded output quality.Background & References
Google DeepMind's official
transformersimplementation recently merged PR #45207, which documents the complete PLE pipeline. Key facts from that PR:hidden_size_per_layer_input > 0(e.g. 256).The 31B Dense model sets it to
0and does not use PLE.embed_tokens_per_layerhas shape[vocab_size_per_layer_input, num_hidden_layers * hidden_size_per_layer_input].input_ids→ lookup inembed_tokens_per_layer→ reshape to[B, S, num_layers, ple_dim]→ scale bysqrt(ple_dim).inputs_embeds→per_layer_model_projection(Linear, no bias) → scale by1/sqrt(hidden_size)→ reshape to[B, S, num_layers, ple_dim]→RMSNorm(eps=rms_norm_eps).(token_identity + context_aware) * (1/sqrt(2)).ireceivesper_layer_inputs[:, :, i, :]as an auxiliary residual signal.input_idsare unavailable (image/audio features replace placeholder tokens), the implementation reverses the main embedding to recoverinput_idsfor the PLE lookup, or falls back to context-aware only.Additional Concern: Quantization & MMap Compatibility
PLE weights (
per_layer_embed,per_layer_model_projection,per_layer_projection_norm) are highly sensitive to quantization noisedue to subsequent scalar multiplication and RMSNorm.
Questions:
convert_hf_to_gguf.pycurrently exclude PLE tensors fromdefault quantization, keeping them in bf16/f16?
handle their dequantization correctly during embedding lookup?
lookup that would require format conversion at load time?
Reference: mlx-gemma4 project's "PLE-Safe Quantization Strategy"
demonstrates that quantizing PLE paths causes catastrophic output
degradation in Gemma 4 E2B/E4B.
Current behavior in llama.cpp
llama_model_loaderreadsgemma4.embedding_length_per_layer_inputcorrectly.per_layer_embed,per_layer_model_projection,per_layer_projection_norm) are present in the GGUF and loaded.get_per_layer_inputs()andproject_per_layer_inputs()equivalents are executed during forward pass. The residual stream in each decoder layer does not appear to be conditioned by the per-layer signal.Expected behavior
E2B/E4B models should produce logits consistent with the reference
transformersimplementation. This requires the full PLE pipeline to be wired into the GGML compute graph.Environment
gemma-4-E2B-it,gemma-4-E4B-it(any quantization)Other frameworks status
Request
The official docstrings from PR #45207 provide sufficient specification to implement this without reverse-engineering.
Related
Operating systems
Windows
GGML backends
CUDA
Hardware
Intel LunarLake / nv 5070
Models
https://huggingface.co/google/gemma-4-31B-it
Problem description & steps to reproduce
Gemma 4 E2B/E4B models rely on Per-Layer Embeddings (PLE) as a core architectural feature.
While llama.cpp's GGUF loader correctly reads the PLE-related metadata keys (e.g. gemma4.embedding_length_per_layer_input), the forward computation graph appears to lack the full PLE pipeline. This means E2B/E4B models run without crashing, but the auxiliary per-layer residual signal is never injected into the decoder layers, leading to subtly degraded output quality.
First Bad Commit
No response
Relevant log output
Logs