Skip to content

StreamingCard re-tokenizes full reply text on every chunk — estimate during streaming, exact only at done #562

@esengine

Description

@esengine

Summary

StreamingCard.tsx runs a full BPE tokenize on the accumulated
streaming reply on every render, not just on the 1Hz tick. With
flash streaming chunks at 30+/s, that's 30+ countTokens(card.text)
calls per second, each on a string that grows unbounded over the
course of the reply.

Same root family as #558 (unbounded BPE on hot paths) but a
different surface — the UI render loop, not the loop's preflight /
shrink path. Different shape of fix: live display only needs an
estimate; exact count is only meaningful when the card settles.

Verified

  • src/cli/ui/cards/StreamingCard.tsx:6 imports countTokens.
  • Two callsites:
    • Line 100tokenRate(card.text, card.ts, Date.now()) on
      every render of the live (streaming) card.
    • Line 66tokenRate(card.text, card.ts, card.endedAt ?? ...)
      on the card.done && !card.aborted branch (settles to Static,
      runs once).
  • tokenRate at line 32-43 always calls countTokens(text).
  • useSlowTick() at line 58 forces a 1Hz re-render so the rate
    keeps updating when chunks stall — but normal chunk arrival
    (props change) re-renders much more frequently.

So the actual frequency is bounded below by 1Hz and above by chunk
rate (~30Hz on flash). Both bounds matter.

Why it can hurt

Most replies are <5000 chars; BPE on that is ~ms and unnoticeable.
But the cost is:

For a long reply with repetitive content (large code blocks, table
output, log dumps), each tokenize can run hundreds of ms. Repeated
tens of times per second on the main thread, the user notices
typing input lag and dropped frames in the streaming preview
animation.

Fix

Two branches, different policies.

done branch (line 66) — leave alone

Runs once when the card settles. Exact tokenize is correct here:
the result is what the user sees on the final pill, accuracy
matters, and the cost is paid once.

Live branch (line 100) — estimate, calibrate sparsely

Replace the per-render countTokens with a length-bucketed
estimate. Calibrate the char/token ratio with a real BPE call only
when the text has grown by a threshold (e.g. +500 chars) since the
last calibration.

const tokenEstimateRef = useRef({ chars: 0, tokens: 0 });
const liveTokens = useMemo(() => {
  const sinceCalibrate = card.text.length - tokenEstimateRef.current.chars;
  if (sinceCalibrate < 500 && tokenEstimateRef.current.tokens > 0) {
    const ratio = tokenEstimateRef.current.tokens / tokenEstimateRef.current.chars;
    return Math.ceil(card.text.length * ratio);
  }
  const exact = countTokens(card.text);
  tokenEstimateRef.current = { chars: card.text.length, tokens: exact };
  return exact;
}, [card.text]);

Net effect: one BPE call per +500 chars instead of one per chunk.
For a 30 chunks/s flash stream that's ~2-3 orders of magnitude
fewer tokenize calls. The displayed t/s pill is approximate
already; ±2% drift from the estimate is invisible.

Same idea — sample-based ratio extrapolation — already lives in
src/mcp/registry.ts:180 for the truncation path. This is a
companion piece of the same pattern in a different layer.

Scope

  • Single file: src/cli/ui/cards/StreamingCard.tsx.
  • ~20 lines changed (refactor tokenRate callers in live path).
  • No new dependencies, no API changes, no test scaffolding beyond
    the existing snapshot tests.

Out of scope

Relates to

#558 — same family ("don't pay BPE cost when an estimate suffices"),
different layer. The fixes don't share code but share the principle.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrenderingTerminal rendering / flicker / repaint issues

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions