StreamingCard re-tokenizes full reply text on every chunk — estimate during streaming, exact only at done

## Summary

`StreamingCard.tsx` runs a full BPE tokenize on the accumulated
streaming reply on **every render**, not just on the 1Hz tick. With
flash streaming chunks at 30+/s, that's 30+ `countTokens(card.text)`
calls per second, each on a string that grows unbounded over the
course of the reply.

Same root family as #558 (unbounded BPE on hot paths) but a
different surface — the UI render loop, not the loop's preflight /
shrink path. Different shape of fix: live display only needs an
estimate; exact count is only meaningful when the card settles.

## Verified

- `src/cli/ui/cards/StreamingCard.tsx:6` imports `countTokens`.
- Two callsites:
  - **Line 100** — `tokenRate(card.text, card.ts, Date.now())` on
    every render of the live (streaming) card.
  - **Line 66** — `tokenRate(card.text, card.ts, card.endedAt ?? ...)`
    on the `card.done && !card.aborted` branch (settles to Static,
    runs once).
- `tokenRate` at line 32-43 always calls `countTokens(text)`.
- `useSlowTick()` at line 58 forces a 1Hz re-render so the rate
  keeps updating when chunks stall — but normal chunk arrival
  (props change) re-renders much more frequently.

So the actual frequency is bounded below by 1Hz and above by chunk
rate (~30Hz on flash). Both bounds matter.

## Why it can hurt

Most replies are <5000 chars; BPE on that is ~ms and unnoticeable.
But the cost is:

- Linear in `card.text.length` (worst case worse — see #558 on the
  pathological repetitive shape).
- Multiplied by render count.

For a long reply with repetitive content (large code blocks, table
output, log dumps), each tokenize can run hundreds of ms. Repeated
tens of times per second on the main thread, the user notices
typing input lag and dropped frames in the streaming preview
animation.

## Fix

Two branches, different policies.

### `done` branch (line 66) — leave alone

Runs once when the card settles. Exact tokenize is correct here:
the result is what the user sees on the final pill, accuracy
matters, and the cost is paid once.

### Live branch (line 100) — estimate, calibrate sparsely

Replace the per-render `countTokens` with a length-bucketed
estimate. Calibrate the char/token ratio with a real BPE call only
when the text has grown by a threshold (e.g. +500 chars) since the
last calibration.

```tsx
const tokenEstimateRef = useRef({ chars: 0, tokens: 0 });
const liveTokens = useMemo(() => {
  const sinceCalibrate = card.text.length - tokenEstimateRef.current.chars;
  if (sinceCalibrate < 500 && tokenEstimateRef.current.tokens > 0) {
    const ratio = tokenEstimateRef.current.tokens / tokenEstimateRef.current.chars;
    return Math.ceil(card.text.length * ratio);
  }
  const exact = countTokens(card.text);
  tokenEstimateRef.current = { chars: card.text.length, tokens: exact };
  return exact;
}, [card.text]);
```

Net effect: one BPE call per +500 chars instead of one per chunk.
For a 30 chunks/s flash stream that's ~2-3 orders of magnitude
fewer tokenize calls. The displayed `t/s` pill is approximate
already; ±2% drift from the estimate is invisible.

Same idea — sample-based ratio extrapolation — already lives in
`src/mcp/registry.ts:180` for the truncation path. This is a
companion piece of the same pattern in a different layer.

## Scope

- Single file: `src/cli/ui/cards/StreamingCard.tsx`.
- ~20 lines changed (refactor `tokenRate` callers in live path).
- No new dependencies, no API changes, no test scaffolding beyond
  the existing snapshot tests.

## Out of scope

- Global `countTokens` cache or memoization across components — that
  belongs in the #558 conversation about `countTokensBounded`.
- Rebuilding `useSlowTick` / chunk batching — orthogonal; current
  re-render frequency is fine if each render is cheap.
- The `done` branch — exact tokenize stays.

## Relates to

#558 — same family ("don't pay BPE cost when an estimate suffices"),
different layer. The fixes don't share code but share the principle.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StreamingCard re-tokenizes full reply text on every chunk — estimate during streaming, exact only at done #562

Summary

Verified

Why it can hurt

Fix

`done` branch (line 66) — leave alone

Live branch (line 100) — estimate, calibrate sparsely

Scope

Out of scope

Relates to

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

StreamingCard re-tokenizes full reply text on every chunk — estimate during streaming, exact only at done #562

Description

Summary

Verified

Why it can hurt

Fix

done branch (line 66) — leave alone

Live branch (line 100) — estimate, calibrate sparsely

Scope

Out of scope

Relates to

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`done` branch (line 66) — leave alone