Description
After commit 91814d4 ("Phi-3.5 server support + Metal workaround"), SmolLM2-1.7B server inference produces garbage output. This model previously worked correctly.
Steps to Reproduce
./build-metal/quant-server SmolLM2-1.7B-Instruct-Q8_0.gguf -p 8080 -j 8
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"What is gravity?"}],"max_tokens":30,"temperature":0.0}'
Actual Output
{"content":"<|im_endturernocturno<|im_ennd>\nWhat is the answer to this question: What is"}
Also tested with TQ_NO_METAL=1 — same garbage output.
Expected Output
Gravity is the force that attracts two objects with mass towards each other...
(Worked correctly in earlier builds before 91814d4)
Root Cause Hypothesis
The tq_matmul_force_cpu thread-local variable and _phi3_force_cpu flag in tq_forward() may not be correctly scoped — if the flag leaks across requests or isn't reset, non-Phi-3 models could get incorrect matmul routing.
Also, the tq_matmul_gguf_cpu extern function may have buffer sizing assumptions that don't hold for Q8_0 matrices.
Unit tests
35/35 pass — the regression is only visible in end-to-end server inference.
Environment
- Commit: 91814d4
- Model: SmolLM2-1.7B-Instruct-Q8_0.gguf (MHA 32/32)
- Build: cmake -DTQ_BUILD_METAL=ON
- OS: macOS 15 (Apple M3, 16GB)
Reported by ClawTeam
Description
After commit
91814d4("Phi-3.5 server support + Metal workaround"), SmolLM2-1.7B server inference produces garbage output. This model previously worked correctly.Steps to Reproduce
Actual Output
{"content":"<|im_endturernocturno<|im_ennd>\nWhat is the answer to this question: What is"}Also tested with
TQ_NO_METAL=1— same garbage output.Expected Output
(Worked correctly in earlier builds before
91814d4)Root Cause Hypothesis
The
tq_matmul_force_cputhread-local variable and_phi3_force_cpuflag intq_forward()may not be correctly scoped — if the flag leaks across requests or isn't reset, non-Phi-3 models could get incorrect matmul routing.Also, the
tq_matmul_gguf_cpuextern function may have buffer sizing assumptions that don't hold for Q8_0 matrices.Unit tests
35/35 pass — the regression is only visible in end-to-end server inference.
Environment
Reported by ClawTeam