Skip to content

fix(dflash): report prefix-cache hits as cached_tokens (#1441)#1768

Merged
jundot merged 1 commit into
jundot:mainfrom
popfido:fix/dflash-cached-tokens-1441
Jun 10, 2026
Merged

fix(dflash): report prefix-cache hits as cached_tokens (#1441)#1768
jundot merged 1 commit into
jundot:mainfrom
popfido:fix/dflash-cached-tokens-1441

Conversation

@popfido

@popfido popfido commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #1441 (DFlash breaks KV prefix cache — cached_tokens: 0 when DFlash is enabled, restored when disabled).

After verifying the issue, it has split in two since it was filed (v0.3.12):

  1. Perf half — already fixed. DFlashEngine now plumbs the prefix snapshot into stream_dflash_generate (PrefixCacheFlow + a persistent L1/L2 runtime_context), so prefill is skipped on a hit. I asserted this on the dflash-mlx side: a cold lookup misses, and a repeat of the same 4273-token prompt returns l1_exact with matched_tokens == 4273 (prefill skipped) — the existing test_prefix_cache_hit_kind.py suite covers it (87 passed).
  2. Reporting half — this PR. The hit count (PrefixCacheFlow.hit_tokens) was computed but never mapped onto the output's cached_tokens, so the API always reported 0 with DFlash on. BatchedEngine/VLM already set cached_tokens=output.cached_tokens; DFlashEngine had zero cached_tokens references.

Fix

Surface PrefixCacheFlow.hit_tokens (matched prompt tokens) as cached_tokens, mirroring BatchedEngine:

  • _cached_tokens_from_flow(prefix_flow) — pure mapping (hit → count, miss/None/missing/negative → 0).
  • Non-streaming generate() — thread prefix_flow out of the executor _run and set cached_tokens on the output.
  • Streaming stream_generate() — carry the count on the final (usage) chunk's metrics only, so the server's per-chunk total_cached_tokens += output.cached_tokens sum isn't inflated; token deltas report 0.

Test plan

  • pytest tests/test_dflash_engine.py -k "TestDFlashCachedTokens and not Wiring" — 5 pure-mapping unit tests (run locally)
  • CI-gated test_generate_sets_cached_tokens_from_hit — asserts generate() sets cached_tokens from a prefix hit (skips where dflash-mlx is unavailable, runs in CI)
  • tests/test_dflash_engine.py — 55 passed (4 pre-existing _build_runtime_context env failures are unrelated, pass in CI); test_output_collector.py/test_server_metrics.py — 63 passed
  • No new ruff findings

Rebased onto current main.

DFlashEngine plumbs the prefix snapshot into stream_dflash_generate (so prefill
IS skipped on a hit), but never set cached_tokens on its GenerationOutput — so
the API reported cached_tokens: 0 on every turn with DFlash enabled, and the
count returned the moment DFlash was disabled (BatchedEngine sets it). The
underlying cache works; only the reporting was missing.

Surface PrefixCacheFlow.hit_tokens (the matched prompt-token count) as
cached_tokens, mirroring BatchedEngine:
- _cached_tokens_from_flow() maps a prefix flow to its hit-token count.
- non-streaming generate(): thread prefix_flow out of the executor and set it
  on the output.
- streaming: carry it on the final (usage) chunk's metrics so the server's
  per-chunk sum isn't inflated.

Tests: pure-mapping unit tests (hit/miss/None/missing/negative) run locally;
a CI-gated end-to-end test asserts generate() sets cached_tokens from a hit.
@jundot

jundot commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Thanks for the focused fix. I verified that this keeps the change scoped to DFlash usage reporting: PrefixCacheFlow.hit_tokens maps to the matched prompt tokens, non-streaming now surfaces it on GenerationOutput.cached_tokens, and streaming only carries it on the final usage chunk so per-chunk aggregation does not overcount.

I also ran the focused DFlash tests and a local DFlash smoke check; repeated prompts now report cached tokens on the second request. This looks good to me, and I'm going to merge it.

@jundot jundot merged commit 223216f into jundot:main Jun 10, 2026
4 checks passed
@popfido popfido deleted the fix/dflash-cached-tokens-1441 branch June 10, 2026 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DFlash engine breaks KV prefix cache — 0 cache hits when DFlash enabled, cache restored after disabling

2 participants