LFM tool-calling — models × frameworks × tests

16 (framework,version,model) runs · generated 2026-06-05T11:40:15 · store=results.jsonl
click any cell for request / response / raw output / reason passfail (parser)model failedserver failednot run
scoring is driven by a reference LFM2 parser: the model's raw output is parsed by the reference and compared to the framework's own parse (✓/✗ in the detail pane). fail = the two disagree (the framework's parser is wrong). When they agree the parse is trusted, so the case check decides: pass, or model failed = the parse was faithful but the call is wrong (the model's own mistake) — counted as neither pass nor fail, and excluded from %.
the raw is the model's actual generation in every case: llama.cpp via __verbose, SGLang via the chat response's own logprobs token stream, vLLM via its server /tokenize prompt — so the framework-vs-reference comparison is exact.
columns:
model
llama.cpp@9519 (7fe2ae45a)
llama.cpp-local@9525 (71b74a408)
singleweatherhelloparalleldottedmixedpar-3markdown%singleweatherhelloparalleldottedmixedpar-3markdown%
LFM2.5-350MPPMPFFFP57%PPMPPPPP100%
LFM2.5-1.2B-InstructPPPPFPFM71%PPPPPPPM100%
LFM2.5-1.2B-JPPPPPMPPM100%PPPPPPPP100%
LFM2.5-1.2B-ThinkingPPPPMFMP83%PPPPMPMP100%
LFM2.5-1.2B-BasePPMPFFFP57%PPMPPMPP100%
LFM2.5-VL-1.6BPPMPFFFM50%PPPPPPPP100%
LFM2.5-8B-A1BPMPPFMFM60%PMPPPMPP100%
LFM2-1.2B-ToolPPPPFFFP62%PPPPPPPP100%
click a cell to see the request, response, and failure reason