LFM tool-calling — frameworks

LFM tool-calling — models × frameworks × tests

16 (framework,version,model) runs · generated 2026-06-05T11:40:15 · store=results.jsonl

click any cell for request / response / raw output / reason passfail (parser)model failedserver failednot run

scoring is driven by a reference LFM2 parser: the model's raw output is parsed by the reference and compared to the framework's own parse (✓/✗ in the detail pane). fail = the two disagree (the framework's parser is wrong). When they agree the parse is trusted, so the case check decides: pass, or model failed = the parse was faithful but the call is wrong (the model's own mistake) — counted as neither pass nor fail, and excluded from %.

the raw is the model's actual generation in every case: llama.cpp via __verbose, SGLang via the chat response's own logprobs token stream, vLLM via its server /tokenize prompt — so the framework-vs-reference comparison is exact.

columns: llama.cpp@9519 (7fe2ae45a)llama.cpp-local@9525 (71b74a408)

model	llama.cpp@9519 (7fe2ae45a)									llama.cpp-local@9525 (71b74a408)
model	single	weather	hello	parallel	dotted	mixed	par-3	markdown	%	single	weather	hello	parallel	dotted	mixed	par-3	markdown	%
LFM2.5-350M	P	P	M	P	F	F	F	P	57%	P	P	M	P	P	P	P	P	100%
LFM2.5-1.2B-Instruct	P	P	P	P	F	P	F	M	71%	P	P	P	P	P	P	P	M	100%
LFM2.5-1.2B-JP	P	P	P	P	M	P	P	M	100%	P	P	P	P	P	P	P	P	100%
LFM2.5-1.2B-Thinking	P	P	P	P	M	F	M	P	83%	P	P	P	P	M	P	M	P	100%
LFM2.5-1.2B-Base	P	P	M	P	F	F	F	P	57%	P	P	M	P	P	M	P	P	100%
LFM2.5-VL-1.6B	P	P	M	P	F	F	F	M	50%	P	P	P	P	P	P	P	P	100%
LFM2.5-8B-A1B	P	M	P	P	F	M	F	M	60%	P	M	P	P	P	M	P	P	100%
LFM2-1.2B-Tool	P	P	P	P	F	F	F	P	62%	P	P	P	P	P	P	P	P	100%

click a cell to see the request, response, and failure reason