Bug / Missing Feature: Gemma 4 E2B/E4B Per-Layer Embeddings (PLE) not implemented in forward graph

### Summary

Gemma 4 E2B/E4B models rely on **Per-Layer Embeddings (PLE)** as a core architectural feature.  
While llama.cpp's GGUF loader correctly reads the PLE-related metadata keys (e.g. `gemma4.embedding_length_per_layer_input`), the **forward computation graph appears to lack the full PLE pipeline**. This means E2B/E4B models run without crashing, but the auxiliary per-layer residual signal is never injected into the decoder layers, leading to subtly degraded output quality.

### Background & References

Google DeepMind's official `transformers` implementation recently merged PR [#45207](https://github.com/huggingface/transformers/pull/45207), which documents the complete PLE pipeline. Key facts from that PR:

- **PLE is mandatory for E2B/E4B**: These models set `hidden_size_per_layer_input > 0` (e.g. 256).  
  The 31B Dense model sets it to `0` and does **not** use PLE.
- **Packed embedding weight**: `embed_tokens_per_layer` has shape `[vocab_size_per_layer_input, num_hidden_layers * hidden_size_per_layer_input]`.
- **Two-component pipeline**:
  1. **Token-identity**: `input_ids` → lookup in `embed_tokens_per_layer` → reshape to `[B, S, num_layers, ple_dim]` → scale by `sqrt(ple_dim)`.
  2. **Context-aware**: `inputs_embeds` → `per_layer_model_projection` (Linear, no bias) → scale by `1/sqrt(hidden_size)` → reshape to `[B, S, num_layers, ple_dim]` → `RMSNorm(eps=rms_norm_eps)`.
  3. **Combine**: `(token_identity + context_aware) * (1/sqrt(2))`.
- **Per-layer injection**: Each decoder layer `i` receives `per_layer_inputs[:, :, i, :]` as an auxiliary residual signal.
- **Multimodal fallback**: When `input_ids` are unavailable (image/audio features replace placeholder tokens), the implementation reverses the main embedding to recover `input_ids` for the PLE lookup, or falls back to context-aware only.

### Additional Concern: Quantization & MMap Compatibility

PLE weights (`per_layer_embed`, `per_layer_model_projection`, 
`per_layer_projection_norm`) are highly sensitive to quantization noise 
due to subsequent scalar multiplication and RMSNorm. 

Questions:
1. Does `convert_hf_to_gguf.py` currently exclude PLE tensors from 
   default quantization, keeping them in bf16/f16?
2. If PLE tensors are quantized in GGUF, does llama.cpp's mmap loader 
   handle their dequantization correctly during embedding lookup?
3. Are there backend limitations (CUDA/Metal/Vulkan) for bf16 embedding 
   lookup that would require format conversion at load time?

Reference: mlx-gemma4 project's "PLE-Safe Quantization Strategy" 
demonstrates that quantizing PLE paths causes catastrophic output 
degradation in Gemma 4 E2B/E4B.

### Current behavior in llama.cpp

- `llama_model_loader` reads `gemma4.embedding_length_per_layer_input` correctly.
- The PLE weights (`per_layer_embed`, `per_layer_model_projection`, `per_layer_projection_norm`) are present in the GGUF and loaded.
- **However**, there is no evidence in the codebase that `get_per_layer_inputs()` and `project_per_layer_inputs()` equivalents are executed during forward pass. The residual stream in each decoder layer does not appear to be conditioned by the per-layer signal.

### Expected behavior

E2B/E4B models should produce logits consistent with the reference `transformers` implementation. This requires the full PLE pipeline to be wired into the GGML compute graph.

### Environment

- llama.cpp version: master (post b8765)
- Models: `gemma-4-E2B-it`, `gemma-4-E4B-it` (any quantization)
- Impact: Quality degradation (not a crash), making E2B/E4B unreliable for production use via llama.cpp.

### Other frameworks status

| Framework | PLE Status |
|-----------|------------|
| **transformers** | ✅ Reference implementation |
| **vLLM** | ✅ Full support (including scale buffers, OOV guards, PP adaptations) |
| **MLX (Python)** | ✅ Day-1 support |
| **TensorRT-LLM** | ✅ Listed as supported architecture |
| **llama.cpp** | ⚠️ Loader only; forward graph incomplete |

### Request

1. Is PLE support on the llama.cpp roadmap for Gemma 4?
2. If not yet planned, would the maintainers accept a community PR implementing the above pipeline?
3. Are there any known blockers (e.g. GGML operator limitations, graph builder constraints) that would complicate the per-layer slicing and injection?

The official docstrings from PR #45207 provide sufficient specification to implement this without reverse-engineering.

### Related

- huggingface/transformers#45206 — Original issue noting PLE was underdocumented.
- huggingface/transformers#45207 — Merged PR adding full PLE docstrings and pipeline documentation.

### Operating systems

Windows

### GGML backends

CUDA

### Hardware

Intel LunarLake / nv 5070

### Models

[https://huggingface.co/google/gemma-4-31B-it](url)

### Problem description & steps to reproduce

Gemma 4 E2B/E4B models rely on Per-Layer Embeddings (PLE) as a core architectural feature.
While llama.cpp's GGUF loader correctly reads the PLE-related metadata keys (e.g. gemma4.embedding_length_per_layer_input), the forward computation graph appears to lack the full PLE pipeline. This means E2B/E4B models run without crashing, but the auxiliary per-layer residual signal is never injected into the decoder layers, leading to subtly degraded output quality.

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug / Missing Feature: Gemma 4 E2B/E4B Per-Layer Embeddings (PLE) not implemented in forward graph #22243

Summary

Background & References

Additional Concern: Quantization & MMap Compatibility

Current behavior in llama.cpp

Expected behavior

Environment

Other frameworks status

Request

Related

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Framework	PLE Status
transformers	✅ Reference implementation
vLLM	✅ Full support (including scale buffers, OOV guards, PP adaptations)
MLX (Python)	✅ Day-1 support
TensorRT-LLM	✅ Listed as supported architecture
llama.cpp	⚠️ Loader only; forward graph incomplete

Bug / Missing Feature: Gemma 4 E2B/E4B Per-Layer Embeddings (PLE) not implemented in forward graph #22243

Description

Summary

Background & References

Additional Concern: Quantization & MMap Compatibility

Current behavior in llama.cpp

Expected behavior

Environment

Other frameworks status

Request

Related

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions