|
| 1 | +# Spike — Phi-3 / Phi-3.5 architecture support |
| 2 | + |
| 3 | +**Date**: 2026-04-12 |
| 4 | +**Driver**: External user feedback (`docs/feedback/2026-04-12_0900.md`, item 2.6) |
| 5 | +**Status**: Investigation complete; implementation gated on having a real GGUF to validate against |
| 6 | +**Recommendation**: do NOT merge a fix without an end-to-end validation run |
| 7 | + |
| 8 | +## Why Phi-3 matters |
| 9 | + |
| 10 | +Phi-3.5-mini is the highest-value model NOT supported by quant.cpp: |
| 11 | + |
| 12 | +- **vocab 32K** — smaller than SmolLM2 (49K), Llama-3.2-1B (128K), Gemma (256K) |
| 13 | +- **3.8B params** — bigger than SmolLM2-1.7B but the small vocab keeps lm_head fast |
| 14 | +- the tester estimated `~94 tok/s` (`60 tokens / 0.85 s`) before realizing the inference was producing garbage — that number reflects what the matmul kernels can do; only the attention path is broken |
| 15 | + |
| 16 | +If we get this working, Phi-3.5-mini becomes the new "best speed/quality" recommendation, ahead of SmolLM2-1.7B. |
| 17 | + |
| 18 | +## Current state |
| 19 | + |
| 20 | +`tq_load_gguf` (in `quant.h`, lines 11640-11680) looks for these tensor names per layer: |
| 21 | + |
| 22 | +``` |
| 23 | +blk.N.attn_q.weight ← required to mark layer as self_attn |
| 24 | +blk.N.attn_k.weight |
| 25 | +blk.N.attn_v.weight |
| 26 | +blk.N.attn_output.weight |
| 27 | +``` |
| 28 | + |
| 29 | +When loading a Phi-3 GGUF, none of these exist — Phi-3 ships fused QKV. Phi-3's tensors (in llama.cpp's GGUF naming convention) are: |
| 30 | + |
| 31 | +``` |
| 32 | +blk.N.attn_qkv.weight ← shape [3 * hidden_dim, hidden_dim], fused |
| 33 | +blk.N.attn_output.weight |
| 34 | +blk.N.ffn_up.weight ← may also be fused as ffn_up_gate, depending on converter |
| 35 | +blk.N.ffn_down.weight |
| 36 | +``` |
| 37 | + |
| 38 | +Result: `is_attn_layer = 0` for every layer, `n_attn_layers = 0`, the new hard-fail check in P0-B catches it and returns NULL with a clear error. No more garbage tokens — but no working inference either. |
| 39 | + |
| 40 | +## Two implementation strategies |
| 41 | + |
| 42 | +### Option A — Loader splits at load time |
| 43 | + |
| 44 | +After detecting `attn_qkv`, dequantize the fused tensor, slice along the output dimension into three `[hidden_dim, hidden_dim]` views, re-quantize each as a separate Q4_K (or whichever type the GGUF used), and store them in `gguf_wq`/`gguf_wk`/`gguf_wv`. |
| 45 | + |
| 46 | +**Pros**: zero forward-path changes, drops into existing `tq_matmul_gguf` calls. |
| 47 | +**Cons**: |
| 48 | +1. Doubles RAM during load (need both fused + split versions) |
| 49 | +2. Re-quantization is **lossy** — running the original model through Q4_K → FP32 → Q4_K introduces measurable error |
| 50 | +3. Won't work for tensor types we don't have a quantizer for (we'd need a quantizer for every supported GGUF type) |
| 51 | +4. Slow at load |
| 52 | + |
| 53 | +### Option B — Forward path dispatches fused matmul (RECOMMENDED) |
| 54 | + |
| 55 | +Add a new field `gguf_wqkv` (data + type) to `tq_layer_weights_t`. Loader sets it from `blk.N.attn_qkv.weight` directly. Forward path checks: if `gguf_wqkv` is set, do one big matmul into a temp buffer of size `3 * hidden_dim`, then split into the existing `s->q`, `s->k`, `s->v` outputs. |
| 56 | + |
| 57 | +**Pros**: |
| 58 | +1. No re-quantization, no precision loss |
| 59 | +2. No extra load-time work |
| 60 | +3. Works with any GGUF type we already support in `tq_matmul_gguf` |
| 61 | +4. Single big matmul is faster than 3 smaller ones (better cache reuse) |
| 62 | + |
| 63 | +**Cons**: |
| 64 | +1. Need a temp buffer for the fused output |
| 65 | +2. New branch in the forward path (small) |
| 66 | +3. Need to pass `q_dim`, `k_dim`, `v_dim` so the split knows where K starts and V starts (Phi-3 may not use GQA, but we can't assume) |
| 67 | + |
| 68 | +`tq_matmul_gguf` already accepts `(weight, type, out_dim, in_dim)` — it doesn't care whether the underlying tensor is fused or not. We can call it once with `out_dim = q_dim + k_dim + v_dim`. |
| 69 | + |
| 70 | +## Inspection results (2026-04-12) |
| 71 | + |
| 72 | +Used `tools/gguf_inspect.c` against `bartowski/Phi-3.5-mini-instruct-Q4_K_M.gguf` (2.39 GB). Findings: |
| 73 | + |
| 74 | +### Per-layer tensors (32 layers, 6 tensors each) |
| 75 | + |
| 76 | +``` |
| 77 | +blk.N.attn_norm.weight F32 [3072] |
| 78 | +blk.N.attn_qkv.weight Q5_K [3072, 9216] ← FUSED QKV (3 * 3072) |
| 79 | +blk.N.attn_output.weight Q4_K [3072, 3072] |
| 80 | +blk.N.ffn_norm.weight F32 [3072] |
| 81 | +blk.N.ffn_up.weight Q4_K [3072, 16384] ← FUSED gate+up (2 * 8192) |
| 82 | +blk.N.ffn_down.weight Q6_K [8192, 3072] |
| 83 | +``` |
| 84 | + |
| 85 | +### Global tensors |
| 86 | + |
| 87 | +``` |
| 88 | +token_embd.weight Q4_K [3072, 32064] |
| 89 | +output.weight Q6_K [3072, 32064] |
| 90 | +output_norm.weight F32 [3072] |
| 91 | +rope_factors_long.weight F32 [48] ← LongRoPE |
| 92 | +rope_factors_short.weight F32 [48] ← LongRoPE |
| 93 | +``` |
| 94 | + |
| 95 | +### Metadata |
| 96 | + |
| 97 | +- arch: `phi3` |
| 98 | +- embedding_length: 3072 (hidden_dim) |
| 99 | +- block_count: 32 |
| 100 | +- head_count: 32 |
| 101 | +- head_count_kv: 32 (NO GQA) |
| 102 | +- rope.dimension_count: 96 (head_dim per head) |
| 103 | +- rope.freq_base: 10000 |
| 104 | +- rope.scaling.original_context_length: 4096 (LongRoPE switch point) |
| 105 | +- rope.scaling.attn_factor: 1.19024 (Q/K magnitude scaling for long context) |
| 106 | +- context_length: 131072 |
| 107 | +- feed_forward_length: 8192 |
| 108 | +- vocab_size: 32064 |
| 109 | +- bos_token_id: 1, eos_token_id: 32000 |
| 110 | + |
| 111 | +### Conclusions |
| 112 | + |
| 113 | +1. **Fused QKV** confirmed. Layout `[Q | K | V]` along output axis. Each section is `hidden_dim = 3072` floats. Total `9216 = 3 * 3072`. |
| 114 | +2. **Fused FFN** ALSO confirmed. `ffn_up.weight` is `[hidden, 2*ff]` not `[hidden, ff]`. Layout `[?, ?]` — order TBD by validation, but llama.cpp's reference loads as `[gate, up]` chunked from this single tensor. |
| 115 | +3. **LongRoPE present**: separate `rope_factors_short` and `rope_factors_long` tables of size 48 = head_dim/2. Used to rescale per-frequency RoPE rotations for sequences past the 4096-token original context. |
| 116 | +4. **No special tokens for ChatML**. Phi-3 uses `<|user|>`, `<|assistant|>`, `<|end|>` (text strings, not BPE special tokens). Chat template differs from Llama-3 / ChatML. |
| 117 | +5. **Vocab 32K** confirms the speed advantage — `lm_head` matmul is `3072 × 32064` vs Llama-3.2-1B's `2048 × 128256`. About 2.7× smaller per-token cost. |
| 118 | + |
| 119 | +## What's still unknown (resolved by trial) |
| 120 | + |
| 121 | +I need a real Phi-3 GGUF to verify: |
| 122 | + |
| 123 | +1. **Exact tensor names**. llama.cpp's GGUF converter has changed conventions over the years. The fused tensor might be named: |
| 124 | + - `blk.N.attn_qkv.weight` |
| 125 | + - `blk.N.attn_qkv_proj.weight` |
| 126 | + - `blk.N.qkv.weight` |
| 127 | + - …and there may be a separate bias tensor |
| 128 | + |
| 129 | +2. **Shape ordering**. Is the fused tensor `[Q | K | V]` along axis 0, or some other layout? Phi-3 has `n_heads = 32` and `n_kv_heads = 32` (no GQA in the 3.8B variant), so all three sub-tensors are the same size — but I want to verify. |
| 130 | + |
| 131 | +3. **FFN fusion**. Does this Phi-3 GGUF use `ffn_up` + `ffn_gate` as separate tensors (llama-style) or `ffn_up_gate` (Phi-style fused)? If the latter, we have a second fused-tensor problem to solve in the same PR. |
| 132 | + |
| 133 | +4. **RoPE config**. Phi-3 long-context variants use LongRoPE with two scaling factors (`short_factor`, `long_factor`). Phi-3-mini's 4K context might use vanilla RoPE — but Phi-3.5-mini's 128K context definitely uses LongRoPE. We'd need to read these from GGUF metadata and add them to `tq_rope`. |
| 134 | + |
| 135 | +5. **Sliding window**. Phi-3 uses `n_block_sparse_window` (varies by layer in some variants). Whether the `mini` variant uses it is unclear. |
| 136 | + |
| 137 | +6. **Special tokens**. Phi-3 uses `<|user|>`, `<|assistant|>`, `<|end|>` instead of ChatML — the chat template needs to know. |
| 138 | + |
| 139 | +## Estimated effort once we have a GGUF |
| 140 | + |
| 141 | +| Step | Effort | |
| 142 | +|---|---| |
| 143 | +| Tensor name detection (`attn_qkv` + variants) | XS — 20 lines | |
| 144 | +| `gguf_wqkv` field + forward dispatch | S — 60 lines | |
| 145 | +| `ffn_up_gate` if needed | S — 40 lines | |
| 146 | +| LongRoPE if Phi-3.5-mini | M — 100-150 lines, needs careful validation | |
| 147 | +| Sliding window detection | S — 30 lines (we have the infrastructure for Gemma) | |
| 148 | +| Phi-3 chat template in `cli.py` | XS — 10 lines | |
| 149 | +| Validation: load + 100 tokens + manual quality check | M — needs the GGUF | |
| 150 | + |
| 151 | +**Total**: maybe 300-400 lines of focused code. Most of it is mechanical once we know the exact names. |
| 152 | + |
| 153 | +## Recommendation |
| 154 | + |
| 155 | +**Option B**, but only after one of: |
| 156 | + |
| 157 | +1. **Tester provides** the exact Phi-3.5-mini-instruct-Q8 GGUF they used. Best path — same file the user already has running. |
| 158 | +2. **Tester runs** a small inspector script we provide that dumps tensor names + shapes from their GGUF, so we can validate our assumptions without shipping the file. |
| 159 | +3. **We pick** a specific bartowski Phi-3.5-mini Q4_K_M variant ourselves, download it, dump tensor names, and proceed. This is the slowest path because the failure modes (LongRoPE, sliding window) are subtle and easy to miss without ground-truth output to compare. |
| 160 | + |
| 161 | +Until then: do NOT implement. The hard-fail in P0-B is the right transition state — users see a clear error and know to wait, instead of debugging garbage. |
| 162 | + |
| 163 | +## Open questions for the human |
| 164 | + |
| 165 | +1. Do we have access to the same Phi-3.5-mini GGUF the tester used? (`Phi-3.5-mini-instruct-Q8_0.gguf`, 3.9 GB) |
| 166 | +2. If not, are we OK downloading one and using it as the reference? Storage / bandwidth? |
| 167 | +3. Should I write the GGUF inspector script (path 2) so the tester can run it for us? |
0 commit comments