Skip to content

Bug / Missing Feature: Gemma 4 E2B/E4B Per-Layer Embeddings (PLE) not implemented in forward graph #22243

@zhang261007

Description

@zhang261007

Summary

Gemma 4 E2B/E4B models rely on Per-Layer Embeddings (PLE) as a core architectural feature.
While llama.cpp's GGUF loader correctly reads the PLE-related metadata keys (e.g. gemma4.embedding_length_per_layer_input), the forward computation graph appears to lack the full PLE pipeline. This means E2B/E4B models run without crashing, but the auxiliary per-layer residual signal is never injected into the decoder layers, leading to subtly degraded output quality.

Background & References

Google DeepMind's official transformers implementation recently merged PR #45207, which documents the complete PLE pipeline. Key facts from that PR:

  • PLE is mandatory for E2B/E4B: These models set hidden_size_per_layer_input > 0 (e.g. 256).
    The 31B Dense model sets it to 0 and does not use PLE.
  • Packed embedding weight: embed_tokens_per_layer has shape [vocab_size_per_layer_input, num_hidden_layers * hidden_size_per_layer_input].
  • Two-component pipeline:
    1. Token-identity: input_ids → lookup in embed_tokens_per_layer → reshape to [B, S, num_layers, ple_dim] → scale by sqrt(ple_dim).
    2. Context-aware: inputs_embedsper_layer_model_projection (Linear, no bias) → scale by 1/sqrt(hidden_size) → reshape to [B, S, num_layers, ple_dim]RMSNorm(eps=rms_norm_eps).
    3. Combine: (token_identity + context_aware) * (1/sqrt(2)).
  • Per-layer injection: Each decoder layer i receives per_layer_inputs[:, :, i, :] as an auxiliary residual signal.
  • Multimodal fallback: When input_ids are unavailable (image/audio features replace placeholder tokens), the implementation reverses the main embedding to recover input_ids for the PLE lookup, or falls back to context-aware only.

Additional Concern: Quantization & MMap Compatibility

PLE weights (per_layer_embed, per_layer_model_projection,
per_layer_projection_norm) are highly sensitive to quantization noise
due to subsequent scalar multiplication and RMSNorm.

Questions:

  1. Does convert_hf_to_gguf.py currently exclude PLE tensors from
    default quantization, keeping them in bf16/f16?
  2. If PLE tensors are quantized in GGUF, does llama.cpp's mmap loader
    handle their dequantization correctly during embedding lookup?
  3. Are there backend limitations (CUDA/Metal/Vulkan) for bf16 embedding
    lookup that would require format conversion at load time?

Reference: mlx-gemma4 project's "PLE-Safe Quantization Strategy"
demonstrates that quantizing PLE paths causes catastrophic output
degradation in Gemma 4 E2B/E4B.

Current behavior in llama.cpp

  • llama_model_loader reads gemma4.embedding_length_per_layer_input correctly.
  • The PLE weights (per_layer_embed, per_layer_model_projection, per_layer_projection_norm) are present in the GGUF and loaded.
  • However, there is no evidence in the codebase that get_per_layer_inputs() and project_per_layer_inputs() equivalents are executed during forward pass. The residual stream in each decoder layer does not appear to be conditioned by the per-layer signal.

Expected behavior

E2B/E4B models should produce logits consistent with the reference transformers implementation. This requires the full PLE pipeline to be wired into the GGML compute graph.

Environment

  • llama.cpp version: master (post b8765)
  • Models: gemma-4-E2B-it, gemma-4-E4B-it (any quantization)
  • Impact: Quality degradation (not a crash), making E2B/E4B unreliable for production use via llama.cpp.

Other frameworks status

Framework PLE Status
transformers ✅ Reference implementation
vLLM ✅ Full support (including scale buffers, OOV guards, PP adaptations)
MLX (Python) ✅ Day-1 support
TensorRT-LLM ✅ Listed as supported architecture
llama.cpp ⚠️ Loader only; forward graph incomplete

Request

  1. Is PLE support on the llama.cpp roadmap for Gemma 4?
  2. If not yet planned, would the maintainers accept a community PR implementing the above pipeline?
  3. Are there any known blockers (e.g. GGML operator limitations, graph builder constraints) that would complicate the per-layer slicing and injection?

The official docstrings from PR #45207 provide sufficient specification to implement this without reverse-engineering.

Related

Operating systems

Windows

GGML backends

CUDA

Hardware

Intel LunarLake / nv 5070

Models

https://huggingface.co/google/gemma-4-31B-it

Problem description & steps to reproduce

Gemma 4 E2B/E4B models rely on Per-Layer Embeddings (PLE) as a core architectural feature.
While llama.cpp's GGUF loader correctly reads the PLE-related metadata keys (e.g. gemma4.embedding_length_per_layer_input), the forward computation graph appears to lack the full PLE pipeline. This means E2B/E4B models run without crashing, but the auxiliary per-layer residual signal is never injected into the decoder layers, leading to subtly degraded output quality.

First Bad Commit

No response

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions