Skip to content

Commit 1e1ea2c

Browse files
unamedkrclaude
andauthored
feat(phi3): end-to-end Phi-3 / Phi-3.5 architecture support (#65)
Adds full inference support for Phi-3 family GGUFs (validated against bartowski's `Phi-3.5-mini-instruct-Q4_K_M.gguf`). Output is coherent multi-paragraph English. Phi-3.5-mini becomes the new "best speed + quality" recommendation in the model registry. ## Why Phi-3 is hard Phi-3 ships fused weight tensors instead of llama-style separate ones, plus a long-context RoPE variant: blk.N.attn_qkv.weight shape [hidden, 3*hidden] Q ‖ K ‖ V blk.N.ffn_up.weight shape [hidden, 2*ff] gate ‖ up rope_factors_short [head_dim/2] LongRoPE rope_factors_long [head_dim/2] LongRoPE + rope.scaling.attn_factor long-context Q scaling Before this PR: load reported `0 self_attn` and the forward pass ran against zero-initialized attention weights → garbage tokens. ## What this PR adds ### Loader (`tq_load_gguf`) - Detects `blk.N.attn_qkv.weight` and stores its raw quantized pointer in a new field `gguf_w_qkv` (+ type). Marks the layer as attention. - Detects `blk.N.ffn_up.weight` with `shape[1] == 2 * intermediate_dim` AND no separate `ffn_gate.weight` → fused gate||up. Stores in `gguf_w_up_gate`. - Reads `phi3.rope.scaling.original_context_length` and `phi3.rope.scaling.attn_factor` via the existing arch-prefix macro. - Locates `rope_factors_short.weight` / `rope_factors_long.weight` as global tensors and stores raw F32 pointers in the model config. - The hard-fail path from the previous PR now correctly identifies Phi-3 as a *supported* architecture (n_attn_layers == 32, not 0). ### Forward (`self_attn_forward` + tq_forward dispatcher) - New `if (layer->gguf_w_qkv)` branch in self_attn_forward: one `tq_matmul_gguf` call into a temp buffer of size `q_dim + 2*kv_dim`, then memcpy splits into `s->q`, `s->k`, `s->v`. The K and V projection blocks below are skipped when fused. - New `if (layer->gguf_w_up_gate)` branch in the FFN section: one matmul of size `2*inter` into `s->hb`, memcpy second half to `s->hb2`. Layout is `[gate | up]` (HuggingFace convention). - The dispatcher (`tq_forward` layer loop) now also calls `self_attn_forward` when `layer->gguf_w_qkv != NULL` — without this the new branch was unreachable because the existing condition checks for `gguf_wq` separately. - Same for the FFN dispatcher: accept `gguf_w_up_gate` as a valid gate/up source. ### LongRoPE - New branch in the full RoPE path: when `rope_factors_short` or `rope_factors_long` is set, apply per-frequency-pair rescaling using `factor[i] = (pos < orig_ctx_len) ? short[i] : long[i]` and `freq[i] = 1 / (rope_base^(2i/head_dim) * factor[i])`. - Uses **NeoX-style** pair layout `(q[i], q[i + half])` rather than the interleaved `(q[2i], q[2i+1])` that quant.cpp's `tq_rope` uses for Llama. The reason: llama.cpp's GGUF converter pre-permutes separate Q/K weights so interleaved RoPE produces equivalent results, but the *fused* `attn_qkv` tensor is NOT permuted. - `rope_attn_factor` is multiplied into Q only when `pos >= orig_ctx_len` — no scaling at short context. ### State allocation (`tq_create_state_ex`) - Bumps `max_dim` to cover `q_dim + 2*kv_dim` (the fused QKV temp buffer reuses `s->xb2`) when `has_fused_qkv` is set. - Bumps the `s->hb` allocation to `2 * inter` (fused gate||up output) when `has_fused_up_gate` is set. `s->hb2` stays at `inter`. ### Tokenizer BOS - `tq_encode` adds Phi-3 / Llama's `<s>` to its BOS lookup chain alongside `<bos>` (Gemma) and `<|im_start|>` (Qwen). - `quant_generate` also enables `add_bos=1` when the vocab has `<s>` — Phi-3 specifically degrades into garbage without it. Existing Llama-3 behavior is unchanged because Llama-3 uses `<|begin_of_text|>` which the lookup chain also handles. ## Validation End-to-end inference test (`tools/phi3_infer_test.c`): ``` $ ./phi3_infer_test ~/.cache/quantcpp/Phi-3.5-mini-instruct-Q4_K_M.gguf The capital of France is Paris. The Eiffel Tower, located in the city center, stands as a symbolic landmark for both locals and tourists alike. The tower's iconic silhouette is visible from various points around the city, offering a panoramic view of Parisian life and its vibrant culture. The Seine River meanders through this historic metropolis... ``` Chat template: ``` $ ./phi3_infer_test ... "What is 2+2?" <|user|> What is 2+2?<|end|> <|assistant|> The sum of 2 + 2 equals to four. ``` Regression checks: - ctest --test-dir build → 35/35 passed - Llama-3.2-1B end-to-end → still coherent - SmolLM2-135M end-to-end → still coherent - Full build clean (no new warnings) ## Registry / docs - `Phi-3.5-mini` added to `_MODEL_REGISTRY` with the bartowski Q4_K_M variant (~2.4 GB). Listed as "best speed + quality" in `docs/supported_models.md`. - New aliases `phi3.5`, `phi3.5:mini`, `phi-3.5`, `phi-3.5-mini`. - Architecture matrix updated: phi3 now ✅ Fully supported. - Docs section "Why phi3 is hard" replaced with "How Phi-3 support works" — explains fused tensors + LongRoPE + NeoX rotation choice. - Spike doc `docs/spikes/2026-04-12_phi3_support.md` updated with inspection findings and the conclusions that drove implementation. ## New tools - `tools/gguf_inspect.c` — dump tensor names, shapes, types, and metadata from a GGUF file. Used to verify Phi-3.5's layout before writing loader code. General-purpose, kept for future architecture work. - `tools/phi3_infer_test.c` — minimal end-to-end inference test. Doubles as a smoke test for any future Phi-3 changes. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent a7795a5 commit 1e1ea2c

7 files changed

Lines changed: 795 additions & 67 deletions

File tree

bindings/python/quantcpp/__init__.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,17 @@ class ChatContextOverflow(RuntimeError):
9696
"llama-3.2-1b-instruct-q4_k_m.gguf",
9797
750,
9898
),
99+
# Phi-3.5-mini-instruct (3.8B params, vocab 32K).
100+
# Added 2026-04-12 after end-to-end Phi-3 architecture support
101+
# landed (fused QKV / fused gate+up FFN / LongRoPE). The 32K vocab
102+
# is the smallest of the registry, which makes the lm_head matmul
103+
# the fastest per-token. Combined with 3.8B params it's the best
104+
# quality-per-token model we ship.
105+
"Phi-3.5-mini": (
106+
"bartowski/Phi-3.5-mini-instruct-GGUF",
107+
"Phi-3.5-mini-instruct-Q4_K_M.gguf",
108+
2400,
109+
),
99110
}
100111

101112
def available_models():

bindings/python/quantcpp/cli.py

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,17 @@
2323
# the recommended default. Users who explicitly want the 135M demo model
2424
# need to ask for it by full name.
2525
MODEL_ALIASES = {
26-
"smollm2": "SmolLM2-1.7B",
27-
"smollm2:1.7b": "SmolLM2-1.7B",
28-
"smollm2:135m": "SmolLM2-135M",
29-
"qwen3.5": "Qwen3.5-0.8B",
30-
"qwen3.5:0.8b": "Qwen3.5-0.8B",
31-
"llama3.2": "Llama-3.2-1B",
32-
"llama3.2:1b": "Llama-3.2-1B",
26+
"smollm2": "SmolLM2-1.7B",
27+
"smollm2:1.7b": "SmolLM2-1.7B",
28+
"smollm2:135m": "SmolLM2-135M",
29+
"qwen3.5": "Qwen3.5-0.8B",
30+
"qwen3.5:0.8b": "Qwen3.5-0.8B",
31+
"llama3.2": "Llama-3.2-1B",
32+
"llama3.2:1b": "Llama-3.2-1B",
33+
"phi3.5": "Phi-3.5-mini",
34+
"phi3.5:mini": "Phi-3.5-mini",
35+
"phi-3.5": "Phi-3.5-mini",
36+
"phi-3.5-mini": "Phi-3.5-mini",
3337
}
3438

3539

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
# Spike — Phi-3 / Phi-3.5 architecture support
2+
3+
**Date**: 2026-04-12
4+
**Driver**: External user feedback (`docs/feedback/2026-04-12_0900.md`, item 2.6)
5+
**Status**: Investigation complete; implementation gated on having a real GGUF to validate against
6+
**Recommendation**: do NOT merge a fix without an end-to-end validation run
7+
8+
## Why Phi-3 matters
9+
10+
Phi-3.5-mini is the highest-value model NOT supported by quant.cpp:
11+
12+
- **vocab 32K** — smaller than SmolLM2 (49K), Llama-3.2-1B (128K), Gemma (256K)
13+
- **3.8B params** — bigger than SmolLM2-1.7B but the small vocab keeps lm_head fast
14+
- the tester estimated `~94 tok/s` (`60 tokens / 0.85 s`) before realizing the inference was producing garbage — that number reflects what the matmul kernels can do; only the attention path is broken
15+
16+
If we get this working, Phi-3.5-mini becomes the new "best speed/quality" recommendation, ahead of SmolLM2-1.7B.
17+
18+
## Current state
19+
20+
`tq_load_gguf` (in `quant.h`, lines 11640-11680) looks for these tensor names per layer:
21+
22+
```
23+
blk.N.attn_q.weight ← required to mark layer as self_attn
24+
blk.N.attn_k.weight
25+
blk.N.attn_v.weight
26+
blk.N.attn_output.weight
27+
```
28+
29+
When loading a Phi-3 GGUF, none of these exist — Phi-3 ships fused QKV. Phi-3's tensors (in llama.cpp's GGUF naming convention) are:
30+
31+
```
32+
blk.N.attn_qkv.weight ← shape [3 * hidden_dim, hidden_dim], fused
33+
blk.N.attn_output.weight
34+
blk.N.ffn_up.weight ← may also be fused as ffn_up_gate, depending on converter
35+
blk.N.ffn_down.weight
36+
```
37+
38+
Result: `is_attn_layer = 0` for every layer, `n_attn_layers = 0`, the new hard-fail check in P0-B catches it and returns NULL with a clear error. No more garbage tokens — but no working inference either.
39+
40+
## Two implementation strategies
41+
42+
### Option A — Loader splits at load time
43+
44+
After detecting `attn_qkv`, dequantize the fused tensor, slice along the output dimension into three `[hidden_dim, hidden_dim]` views, re-quantize each as a separate Q4_K (or whichever type the GGUF used), and store them in `gguf_wq`/`gguf_wk`/`gguf_wv`.
45+
46+
**Pros**: zero forward-path changes, drops into existing `tq_matmul_gguf` calls.
47+
**Cons**:
48+
1. Doubles RAM during load (need both fused + split versions)
49+
2. Re-quantization is **lossy** — running the original model through Q4_K → FP32 → Q4_K introduces measurable error
50+
3. Won't work for tensor types we don't have a quantizer for (we'd need a quantizer for every supported GGUF type)
51+
4. Slow at load
52+
53+
### Option B — Forward path dispatches fused matmul (RECOMMENDED)
54+
55+
Add a new field `gguf_wqkv` (data + type) to `tq_layer_weights_t`. Loader sets it from `blk.N.attn_qkv.weight` directly. Forward path checks: if `gguf_wqkv` is set, do one big matmul into a temp buffer of size `3 * hidden_dim`, then split into the existing `s->q`, `s->k`, `s->v` outputs.
56+
57+
**Pros**:
58+
1. No re-quantization, no precision loss
59+
2. No extra load-time work
60+
3. Works with any GGUF type we already support in `tq_matmul_gguf`
61+
4. Single big matmul is faster than 3 smaller ones (better cache reuse)
62+
63+
**Cons**:
64+
1. Need a temp buffer for the fused output
65+
2. New branch in the forward path (small)
66+
3. Need to pass `q_dim`, `k_dim`, `v_dim` so the split knows where K starts and V starts (Phi-3 may not use GQA, but we can't assume)
67+
68+
`tq_matmul_gguf` already accepts `(weight, type, out_dim, in_dim)` — it doesn't care whether the underlying tensor is fused or not. We can call it once with `out_dim = q_dim + k_dim + v_dim`.
69+
70+
## Inspection results (2026-04-12)
71+
72+
Used `tools/gguf_inspect.c` against `bartowski/Phi-3.5-mini-instruct-Q4_K_M.gguf` (2.39 GB). Findings:
73+
74+
### Per-layer tensors (32 layers, 6 tensors each)
75+
76+
```
77+
blk.N.attn_norm.weight F32 [3072]
78+
blk.N.attn_qkv.weight Q5_K [3072, 9216] ← FUSED QKV (3 * 3072)
79+
blk.N.attn_output.weight Q4_K [3072, 3072]
80+
blk.N.ffn_norm.weight F32 [3072]
81+
blk.N.ffn_up.weight Q4_K [3072, 16384] ← FUSED gate+up (2 * 8192)
82+
blk.N.ffn_down.weight Q6_K [8192, 3072]
83+
```
84+
85+
### Global tensors
86+
87+
```
88+
token_embd.weight Q4_K [3072, 32064]
89+
output.weight Q6_K [3072, 32064]
90+
output_norm.weight F32 [3072]
91+
rope_factors_long.weight F32 [48] ← LongRoPE
92+
rope_factors_short.weight F32 [48] ← LongRoPE
93+
```
94+
95+
### Metadata
96+
97+
- arch: `phi3`
98+
- embedding_length: 3072 (hidden_dim)
99+
- block_count: 32
100+
- head_count: 32
101+
- head_count_kv: 32 (NO GQA)
102+
- rope.dimension_count: 96 (head_dim per head)
103+
- rope.freq_base: 10000
104+
- rope.scaling.original_context_length: 4096 (LongRoPE switch point)
105+
- rope.scaling.attn_factor: 1.19024 (Q/K magnitude scaling for long context)
106+
- context_length: 131072
107+
- feed_forward_length: 8192
108+
- vocab_size: 32064
109+
- bos_token_id: 1, eos_token_id: 32000
110+
111+
### Conclusions
112+
113+
1. **Fused QKV** confirmed. Layout `[Q | K | V]` along output axis. Each section is `hidden_dim = 3072` floats. Total `9216 = 3 * 3072`.
114+
2. **Fused FFN** ALSO confirmed. `ffn_up.weight` is `[hidden, 2*ff]` not `[hidden, ff]`. Layout `[?, ?]` — order TBD by validation, but llama.cpp's reference loads as `[gate, up]` chunked from this single tensor.
115+
3. **LongRoPE present**: separate `rope_factors_short` and `rope_factors_long` tables of size 48 = head_dim/2. Used to rescale per-frequency RoPE rotations for sequences past the 4096-token original context.
116+
4. **No special tokens for ChatML**. Phi-3 uses `<|user|>`, `<|assistant|>`, `<|end|>` (text strings, not BPE special tokens). Chat template differs from Llama-3 / ChatML.
117+
5. **Vocab 32K** confirms the speed advantage — `lm_head` matmul is `3072 × 32064` vs Llama-3.2-1B's `2048 × 128256`. About 2.7× smaller per-token cost.
118+
119+
## What's still unknown (resolved by trial)
120+
121+
I need a real Phi-3 GGUF to verify:
122+
123+
1. **Exact tensor names**. llama.cpp's GGUF converter has changed conventions over the years. The fused tensor might be named:
124+
- `blk.N.attn_qkv.weight`
125+
- `blk.N.attn_qkv_proj.weight`
126+
- `blk.N.qkv.weight`
127+
- …and there may be a separate bias tensor
128+
129+
2. **Shape ordering**. Is the fused tensor `[Q | K | V]` along axis 0, or some other layout? Phi-3 has `n_heads = 32` and `n_kv_heads = 32` (no GQA in the 3.8B variant), so all three sub-tensors are the same size — but I want to verify.
130+
131+
3. **FFN fusion**. Does this Phi-3 GGUF use `ffn_up` + `ffn_gate` as separate tensors (llama-style) or `ffn_up_gate` (Phi-style fused)? If the latter, we have a second fused-tensor problem to solve in the same PR.
132+
133+
4. **RoPE config**. Phi-3 long-context variants use LongRoPE with two scaling factors (`short_factor`, `long_factor`). Phi-3-mini's 4K context might use vanilla RoPE — but Phi-3.5-mini's 128K context definitely uses LongRoPE. We'd need to read these from GGUF metadata and add them to `tq_rope`.
134+
135+
5. **Sliding window**. Phi-3 uses `n_block_sparse_window` (varies by layer in some variants). Whether the `mini` variant uses it is unclear.
136+
137+
6. **Special tokens**. Phi-3 uses `<|user|>`, `<|assistant|>`, `<|end|>` instead of ChatML — the chat template needs to know.
138+
139+
## Estimated effort once we have a GGUF
140+
141+
| Step | Effort |
142+
|---|---|
143+
| Tensor name detection (`attn_qkv` + variants) | XS — 20 lines |
144+
| `gguf_wqkv` field + forward dispatch | S — 60 lines |
145+
| `ffn_up_gate` if needed | S — 40 lines |
146+
| LongRoPE if Phi-3.5-mini | M — 100-150 lines, needs careful validation |
147+
| Sliding window detection | S — 30 lines (we have the infrastructure for Gemma) |
148+
| Phi-3 chat template in `cli.py` | XS — 10 lines |
149+
| Validation: load + 100 tokens + manual quality check | M — needs the GGUF |
150+
151+
**Total**: maybe 300-400 lines of focused code. Most of it is mechanical once we know the exact names.
152+
153+
## Recommendation
154+
155+
**Option B**, but only after one of:
156+
157+
1. **Tester provides** the exact Phi-3.5-mini-instruct-Q8 GGUF they used. Best path — same file the user already has running.
158+
2. **Tester runs** a small inspector script we provide that dumps tensor names + shapes from their GGUF, so we can validate our assumptions without shipping the file.
159+
3. **We pick** a specific bartowski Phi-3.5-mini Q4_K_M variant ourselves, download it, dump tensor names, and proceed. This is the slowest path because the failure modes (LongRoPE, sliding window) are subtle and easy to miss without ground-truth output to compare.
160+
161+
Until then: do NOT implement. The hard-fail in P0-B is the right transition state — users see a clear error and know to wait, instead of debugging garbage.
162+
163+
## Open questions for the human
164+
165+
1. Do we have access to the same Phi-3.5-mini GGUF the tester used? (`Phi-3.5-mini-instruct-Q8_0.gguf`, 3.9 GB)
166+
2. If not, are we OK downloading one and using it as the reference? Storage / bandwidth?
167+
3. Should I write the GGUF inspector script (path 2) so the tester can run it for us?

docs/supported_models.md

Lines changed: 33 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ tracks what works, what loads-but-fails, and how to pick a model.
88

99
| Use case | Model | Why |
1010
|---|---|---|
11-
| **First-time install** | `SmolLM2-1.7B` (Q8) | Fastest end-to-end on a laptop. Vocab 49K keeps the lm_head matmul small (~12 tok/s on Apple M3). |
11+
| **Best speed + quality** | `Phi-3.5-mini` (Q4_K_M) | 3.8B params with vocab 32K — the smallest lm_head in the registry. Coherent multi-paragraph output. |
12+
| **Lightweight all-rounder** | `SmolLM2-1.7B` (Q8) | Fastest small model on a laptop. Vocab 49K keeps the lm_head matmul small (~12 tok/s on Apple M3). |
1213
| Smaller download | `Llama-3.2-1B` (Q4_K_M) | 750 MB vs 1.7 GB, but ~5x slower at inference time due to 128K vocab. |
1314
| Quick smoke test | `SmolLM2-135M` (Q8) | 138 MB download to verify the install path. Output quality is poor — not for real use. |
1415

@@ -32,12 +33,12 @@ print(m.ask("What is gravity?"))
3233
|---|:---:|:---:|:---:|:---:|---|
3334
| **llama** (SmolLM2, Llama-3.x, Mistral) ||||| **Fully supported** |
3435
| llama with 128K vocab (Llama-3.2-1B) |||| slow | Supported, vocab is the bottleneck |
36+
| **phi3** / **phi3.5** (fused QKV + LongRoPE) ||||| **Fully supported** (since 2026-04-12) |
3537
| **gemma** (Gemma 2) ||||| Supported |
3638
| **gemma3** ||||| Supported with hybrid sliding-window attention |
3739
| **gemma4** (Gemma-4-E2B / E4B) ||| ⚠️ | ⚠️ | Partial — some Q4_K_M variants produce garbage; report with file SHA256 |
3840
| **qwen** / **qwen2** ||||| Supported |
3941
| **qwen3.5** (DeltaNet hybrid) ||| partial | ⚠️ | Partial — pure-attention layers work, DeltaNet hybrid still being validated |
40-
| **phi3** / **phi3.5** (fused QKV) ||||| **Not supported** — uses `attn_qkv`, see "Why phi3 is hard" below |
4142

4243
✅ = works · ⚠️ = loads but inference is unreliable · ❌ = load fails fast with a clear error (since 2026-04-12)
4344

@@ -78,31 +79,38 @@ benchmarks on Apple M3 (8-core CPU, 16 GB RAM):
7879
vocab size is a better predictor of interactive latency than parameter
7980
count. Pick the smallest vocab that produces output you're happy with.
8081

81-
## Why phi3 is hard
82+
## How Phi-3 support works
8283

83-
Phi-3 / Phi-3.5 uses a *fused* QKV projection: instead of three separate
84-
tensors `attn_q.weight`, `attn_k.weight`, `attn_v.weight`, it ships one
85-
`attn_qkv.weight` with all three projections concatenated along the
86-
output dimension.
84+
Phi-3 / Phi-3.5 uses fused weight tensors instead of llama-style separate ones:
8785

88-
quant.cpp's GGUF loader currently looks for the three-tensor layout
89-
(`blk.N.attn_q.weight` etc.). When it loads a Phi-3 GGUF, none of those
90-
names match → 0 self_attn layers detected → forward pass runs against
91-
zero-initialized attention weights → garbage tokens.
92-
93-
Adding Phi-3 support requires either:
94-
95-
1. **Loader splits** `attn_qkv.weight` into the three views at load time
96-
and writes them into the existing `wq`/`wk`/`wv` slots, OR
97-
2. **Forward path** learns to dispatch a fused QKV matmul when the
98-
loader detects the fused tensor.
99-
100-
Option (1) is simpler but doubles the working set during load. Option
101-
(2) is the right long-term answer. There's a tracking issue / spike in
102-
progress; until then Phi-3 is the highest-value missing architecture for
103-
quant.cpp's "speed + quality" target (Phi-3.5-mini has vocab 32K plus
104-
3.8B params — it would beat both SmolLM2-1.7B and Llama-3.2-1B at
105-
interactive use).
86+
| Tensor | Shape | What's inside |
87+
|---|---|---|
88+
| `blk.N.attn_qkv.weight` | `[hidden, 3*hidden]` | Q ‖ K ‖ V along the output axis |
89+
| `blk.N.ffn_up.weight` | `[hidden, 2*ff]` | gate ‖ up along the output axis |
90+
91+
The loader detects these by name, stores the raw quantized pointers in
92+
new fields (`gguf_w_qkv`, `gguf_w_up_gate`), and the forward path
93+
dispatches a single matmul into a temp buffer for each, then `memcpy`
94+
splits the result into the existing per-section state buffers.
95+
96+
Phi-3 also uses **LongRoPE** with two per-frequency-pair rescaling
97+
tables (`rope_factors_short`, `rope_factors_long`) and a separate
98+
attention magnitude factor (`rope.scaling.attn_factor`). These extend
99+
RoPE rotation from the original 4096-token training context out to
100+
131K. The forward path picks the short or long table based on
101+
position, applies the rescaled rotation in **NeoX-style** layout (pairs
102+
are `(q[i], q[i+half])`, not `(q[2i], q[2i+1])`), and multiplies Q by
103+
`attn_factor` only when `pos >= original_context_length`.
104+
105+
Why NeoX-style for Phi-3 specifically: llama.cpp's GGUF converter
106+
pre-permutes separate `attn_q/k/v` tensors so the standard interleaved
107+
RoPE works for Llama-family models. The fused `attn_qkv` tensor is NOT
108+
permuted, so we have to apply rotation in its native NeoX form.
109+
110+
Phi-3.5-mini at the recommended Q4_K_M quantization clocks in at
111+
**~32K vocab + 3.8B params**, which makes the lm_head matmul the
112+
fastest of any model in the registry — the best speed/quality combo
113+
quant.cpp ships.
106114

107115
## Reporting an unsupported model
108116

0 commit comments

Comments
 (0)