fix(dflash): report prefix-cache hits as cached_tokens (#1441) by popfido · Pull Request #1768 · jundot/omlx

popfido · 2026-06-09T11:43:22Z

Summary

Fixes #1441 (DFlash breaks KV prefix cache — cached_tokens: 0 when DFlash is enabled, restored when disabled).

After verifying the issue, it has split in two since it was filed (v0.3.12):

Perf half — already fixed. DFlashEngine now plumbs the prefix snapshot into stream_dflash_generate (PrefixCacheFlow + a persistent L1/L2 runtime_context), so prefill is skipped on a hit. I asserted this on the dflash-mlx side: a cold lookup misses, and a repeat of the same 4273-token prompt returns l1_exact with matched_tokens == 4273 (prefill skipped) — the existing test_prefix_cache_hit_kind.py suite covers it (87 passed).
Reporting half — this PR. The hit count (PrefixCacheFlow.hit_tokens) was computed but never mapped onto the output's cached_tokens, so the API always reported 0 with DFlash on. BatchedEngine/VLM already set cached_tokens=output.cached_tokens; DFlashEngine had zero cached_tokens references.

Fix

Surface PrefixCacheFlow.hit_tokens (matched prompt tokens) as cached_tokens, mirroring BatchedEngine:

_cached_tokens_from_flow(prefix_flow) — pure mapping (hit → count, miss/None/missing/negative → 0).
Non-streaming generate() — thread prefix_flow out of the executor _run and set cached_tokens on the output.
Streaming stream_generate() — carry the count on the final (usage) chunk's metrics only, so the server's per-chunk total_cached_tokens += output.cached_tokens sum isn't inflated; token deltas report 0.

Test plan

pytest tests/test_dflash_engine.py -k "TestDFlashCachedTokens and not Wiring" — 5 pure-mapping unit tests (run locally)
CI-gated test_generate_sets_cached_tokens_from_hit — asserts generate() sets cached_tokens from a prefix hit (skips where dflash-mlx is unavailable, runs in CI)
tests/test_dflash_engine.py — 55 passed (4 pre-existing _build_runtime_context env failures are unrelated, pass in CI); test_output_collector.py/test_server_metrics.py — 63 passed
No new ruff findings

Rebased onto current main.

DFlashEngine plumbs the prefix snapshot into stream_dflash_generate (so prefill IS skipped on a hit), but never set cached_tokens on its GenerationOutput — so the API reported cached_tokens: 0 on every turn with DFlash enabled, and the count returned the moment DFlash was disabled (BatchedEngine sets it). The underlying cache works; only the reporting was missing. Surface PrefixCacheFlow.hit_tokens (the matched prompt-token count) as cached_tokens, mirroring BatchedEngine: - _cached_tokens_from_flow() maps a prefix flow to its hit-token count. - non-streaming generate(): thread prefix_flow out of the executor and set it on the output. - streaming: carry it on the final (usage) chunk's metrics so the server's per-chunk sum isn't inflated. Tests: pure-mapping unit tests (hit/miss/None/missing/negative) run locally; a CI-gated end-to-end test asserts generate() sets cached_tokens from a hit.

jundot · 2026-06-10T02:10:23Z

Thanks for the focused fix. I verified that this keeps the change scoped to DFlash usage reporting: PrefixCacheFlow.hit_tokens maps to the matched prompt tokens, non-streaming now surfaces it on GenerationOutput.cached_tokens, and streaming only carries it on the final usage chunk so per-chunk aggregation does not overcount.

I also ran the focused DFlash tests and a local DFlash smoke check; repeated prompts now report cached tokens on the second request. This looks good to me, and I'm going to merge it.

jundot merged commit 223216f into jundot:main Jun 10, 2026
4 checks passed

popfido deleted the fix/dflash-cached-tokens-1441 branch June 10, 2026 02:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dflash): report prefix-cache hits as cached_tokens (#1441)#1768

fix(dflash): report prefix-cache hits as cached_tokens (#1441)#1768
jundot merged 1 commit into
jundot:mainfrom
popfido:fix/dflash-cached-tokens-1441

popfido commented Jun 9, 2026

Uh oh!

jundot commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

popfido commented Jun 9, 2026

Summary

Fix

Test plan

Uh oh!

jundot commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants