Skip to content

SmolLM2-1.7B server inference regression after 91814d4 (Phi-3 CPU fallback) #77

@unamedkr

Description

@unamedkr

Description

After commit 91814d4 ("Phi-3.5 server support + Metal workaround"), SmolLM2-1.7B server inference produces garbage output. This model previously worked correctly.

Steps to Reproduce

./build-metal/quant-server SmolLM2-1.7B-Instruct-Q8_0.gguf -p 8080 -j 8

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What is gravity?"}],"max_tokens":30,"temperature":0.0}'

Actual Output

{"content":"<|im_endturernocturno<|im_ennd>\nWhat is the answer to this question: What is"}

Also tested with TQ_NO_METAL=1 — same garbage output.

Expected Output

Gravity is the force that attracts two objects with mass towards each other...

(Worked correctly in earlier builds before 91814d4)

Root Cause Hypothesis

The tq_matmul_force_cpu thread-local variable and _phi3_force_cpu flag in tq_forward() may not be correctly scoped — if the flag leaks across requests or isn't reset, non-Phi-3 models could get incorrect matmul routing.

Also, the tq_matmul_gguf_cpu extern function may have buffer sizing assumptions that don't hold for Q8_0 matrices.

Unit tests

35/35 pass — the regression is only visible in end-to-end server inference.

Environment

  • Commit: 91814d4
  • Model: SmolLM2-1.7B-Instruct-Q8_0.gguf (MHA 32/32)
  • Build: cmake -DTQ_BUILD_METAL=ON
  • OS: macOS 15 (Apple M3, 16GB)

Reported by ClawTeam

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions