Skip to content

perf(dflash): gate top-5 logits diagnostic behind LLAMA_DFLASH_DEBUG#78

Merged
marksverdhei merged 1 commit into
htfrom
perf/dflash-gate-debug-top5
Jun 12, 2026
Merged

perf(dflash): gate top-5 logits diagnostic behind LLAMA_DFLASH_DEBUG#78
marksverdhei merged 1 commit into
htfrom
perf/dflash-gate-debug-top5

Conversation

@marksverdhei

Copy link
Copy Markdown

Epoch #73 task 5 (DFlash perf scout, in lieu of titan-gated bench).

Finding

common/speculative.cpp:935 has an unconditional top-5 logits selection inside the DFlash drafter's hot path:

if (i == 1) {
    std::vector<int> top;
    top.reserve(5);
    for (int j = 0; j < n_vocab; ++j) {
        if ((int) top.size() < 5) {
            top.push_back(j);
            std::sort(top.begin(), top.end(), [&](int a, int b) { return logits[a] > logits[b]; });
        } else if (logits[j] > logits[top.back()]) {
            top.back() = j;
            std::sort(top.begin(), top.end(), [&](int a, int b) { return logits[a] > logits[b]; });
        }
    }
    // build top_dbg string and LOG_INF
}

The if (i == 1) gates when (once per draft call) but not whether. The LOG_INF below is verbosity-gated, so on production the log is suppressed — but the O(n_vocab × log 5) selection still runs.

On gemma-class vocabs (~256k tokens), that's ~1ms per draft call. At Round-10's measured ~8% accept rate, every output token costs several draft calls — so this debug computation is in the steady-state hot path.

Fix

Extend the gate to if (i == 1 && dflash_debug). dflash_debug is the cached env-var probe already declared at line 883 (used by the features-debug block immediately above).

Verified

  • cmake --build build --target llama-server succeeds
  • ✅ Behavior change: only when LLAMA_DFLASH_DEBUG is set (was unconditional → now gated; identical code runs when enabled)

Out of scope

Other DFlash hot-path observations from the scout that did NOT make this PR (no clear-win fix):

  • accumulated_ctx grows unboundedly across draft calls. With default LLAMA_DFLASH_CTX_WINDOW=512 only the tail is used, but the buffer keeps growing. Memory leak shape, not perf. Worth a follow-up trim.
  • std::sort inside the j-loop is asymptotically fine but has a 5-element sort per insertion. Could be replaced by a hand-rolled 5-slot insertion-sort (O(5) vs O(5 log 5 × n_vocab) constant factor). Speed-up is small and only matters when DFLASH_DEBUG=1.

Epoch #73 task 5 (DFlash perf scout, in lieu of titan-gated bench).

The top-5 logits selection at common/speculative.cpp:935 was
unconditional — `if (i == 1)` gated when (once per draft call) but not
whether. The LOG_INF below it is verbosity-gated, so on production the
log is suppressed, but the O(n_vocab * log 5) selection still runs.

On gemma-class vocabs (~256k tokens) the selection burns ~1ms per draft
call. At Round-10's measured ~8% accept rate, every output token costs
several draft calls — so this debug computation is in the steady-state
hot path.

Fix: extend the gate to `if (i == 1 && dflash_debug)`. `dflash_debug`
is the cached env-var probe already declared at line 883 (used by the
features-debug block immediately above). When LLAMA_DFLASH_DEBUG is set
the diagnostic still fires; production is unaffected.

Found during epoch #73 task 5 — DFlash hot-path scout. Local CPU build
verifies; behavior change only when LLAMA_DFLASH_DEBUG is set (was
unconditional → now gated; same code runs when enabled).
@marksverdhei marksverdhei merged commit 6733bc1 into ht Jun 12, 2026
3 of 7 checks passed
@marksverdhei marksverdhei deleted the perf/dflash-gate-debug-top5 branch June 12, 2026 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant