Summary
StreamingCard.tsx runs a full BPE tokenize on the accumulated
streaming reply on every render, not just on the 1Hz tick. With
flash streaming chunks at 30+/s, that's 30+ countTokens(card.text)
calls per second, each on a string that grows unbounded over the
course of the reply.
Same root family as #558 (unbounded BPE on hot paths) but a
different surface — the UI render loop, not the loop's preflight /
shrink path. Different shape of fix: live display only needs an
estimate; exact count is only meaningful when the card settles.
Verified
src/cli/ui/cards/StreamingCard.tsx:6 imports countTokens.
- Two callsites:
- Line 100 —
tokenRate(card.text, card.ts, Date.now()) on
every render of the live (streaming) card.
- Line 66 —
tokenRate(card.text, card.ts, card.endedAt ?? ...)
on the card.done && !card.aborted branch (settles to Static,
runs once).
tokenRate at line 32-43 always calls countTokens(text).
useSlowTick() at line 58 forces a 1Hz re-render so the rate
keeps updating when chunks stall — but normal chunk arrival
(props change) re-renders much more frequently.
So the actual frequency is bounded below by 1Hz and above by chunk
rate (~30Hz on flash). Both bounds matter.
Why it can hurt
Most replies are <5000 chars; BPE on that is ~ms and unnoticeable.
But the cost is:
For a long reply with repetitive content (large code blocks, table
output, log dumps), each tokenize can run hundreds of ms. Repeated
tens of times per second on the main thread, the user notices
typing input lag and dropped frames in the streaming preview
animation.
Fix
Two branches, different policies.
done branch (line 66) — leave alone
Runs once when the card settles. Exact tokenize is correct here:
the result is what the user sees on the final pill, accuracy
matters, and the cost is paid once.
Live branch (line 100) — estimate, calibrate sparsely
Replace the per-render countTokens with a length-bucketed
estimate. Calibrate the char/token ratio with a real BPE call only
when the text has grown by a threshold (e.g. +500 chars) since the
last calibration.
const tokenEstimateRef = useRef({ chars: 0, tokens: 0 });
const liveTokens = useMemo(() => {
const sinceCalibrate = card.text.length - tokenEstimateRef.current.chars;
if (sinceCalibrate < 500 && tokenEstimateRef.current.tokens > 0) {
const ratio = tokenEstimateRef.current.tokens / tokenEstimateRef.current.chars;
return Math.ceil(card.text.length * ratio);
}
const exact = countTokens(card.text);
tokenEstimateRef.current = { chars: card.text.length, tokens: exact };
return exact;
}, [card.text]);
Net effect: one BPE call per +500 chars instead of one per chunk.
For a 30 chunks/s flash stream that's ~2-3 orders of magnitude
fewer tokenize calls. The displayed t/s pill is approximate
already; ±2% drift from the estimate is invisible.
Same idea — sample-based ratio extrapolation — already lives in
src/mcp/registry.ts:180 for the truncation path. This is a
companion piece of the same pattern in a different layer.
Scope
- Single file:
src/cli/ui/cards/StreamingCard.tsx.
- ~20 lines changed (refactor
tokenRate callers in live path).
- No new dependencies, no API changes, no test scaffolding beyond
the existing snapshot tests.
Out of scope
Relates to
#558 — same family ("don't pay BPE cost when an estimate suffices"),
different layer. The fixes don't share code but share the principle.
Summary
StreamingCard.tsxruns a full BPE tokenize on the accumulatedstreaming reply on every render, not just on the 1Hz tick. With
flash streaming chunks at 30+/s, that's 30+
countTokens(card.text)calls per second, each on a string that grows unbounded over the
course of the reply.
Same root family as #558 (unbounded BPE on hot paths) but a
different surface — the UI render loop, not the loop's preflight /
shrink path. Different shape of fix: live display only needs an
estimate; exact count is only meaningful when the card settles.
Verified
src/cli/ui/cards/StreamingCard.tsx:6importscountTokens.tokenRate(card.text, card.ts, Date.now())onevery render of the live (streaming) card.
tokenRate(card.text, card.ts, card.endedAt ?? ...)on the
card.done && !card.abortedbranch (settles to Static,runs once).
tokenRateat line 32-43 always callscountTokens(text).useSlowTick()at line 58 forces a 1Hz re-render so the ratekeeps updating when chunks stall — but normal chunk arrival
(props change) re-renders much more frequently.
So the actual frequency is bounded below by 1Hz and above by chunk
rate (~30Hz on flash). Both bounds matter.
Why it can hurt
Most replies are <5000 chars; BPE on that is ~ms and unnoticeable.
But the cost is:
card.text.length(worst case worse — see AuditcountTokenscallers — add upper-char bound to defend non-MCP paths from pathological BPE input #558 on thepathological repetitive shape).
For a long reply with repetitive content (large code blocks, table
output, log dumps), each tokenize can run hundreds of ms. Repeated
tens of times per second on the main thread, the user notices
typing input lag and dropped frames in the streaming preview
animation.
Fix
Two branches, different policies.
donebranch (line 66) — leave aloneRuns once when the card settles. Exact tokenize is correct here:
the result is what the user sees on the final pill, accuracy
matters, and the cost is paid once.
Live branch (line 100) — estimate, calibrate sparsely
Replace the per-render
countTokenswith a length-bucketedestimate. Calibrate the char/token ratio with a real BPE call only
when the text has grown by a threshold (e.g. +500 chars) since the
last calibration.
Net effect: one BPE call per +500 chars instead of one per chunk.
For a 30 chunks/s flash stream that's ~2-3 orders of magnitude
fewer tokenize calls. The displayed
t/spill is approximatealready; ±2% drift from the estimate is invisible.
Same idea — sample-based ratio extrapolation — already lives in
src/mcp/registry.ts:180for the truncation path. This is acompanion piece of the same pattern in a different layer.
Scope
src/cli/ui/cards/StreamingCard.tsx.tokenRatecallers in live path).the existing snapshot tests.
Out of scope
countTokenscache or memoization across components — thatbelongs in the Audit
countTokenscallers — add upper-char bound to defend non-MCP paths from pathological BPE input #558 conversation aboutcountTokensBounded.useSlowTick/ chunk batching — orthogonal; currentre-render frequency is fine if each render is cheap.
donebranch — exact tokenize stays.Relates to
#558 — same family ("don't pay BPE cost when an estimate suffices"),
different layer. The fixes don't share code but share the principle.