Skip to content

perf(tokenizer): cache BPE + bounded counts, fast-path truncate (-57% CPU, -22% RSS)#1741

Merged
esengine merged 1 commit into
mainfrom
perf/tokenizer-cache-and-truncate-fastpath
May 25, 2026
Merged

perf(tokenizer): cache BPE + bounded counts, fast-path truncate (-57% CPU, -22% RSS)#1741
esengine merged 1 commit into
mainfrom
perf/tokenizer-cache-and-truncate-fastpath

Conversation

@esengine

Copy link
Copy Markdown
Owner

Summary

Three small changes that together cut per-turn CPU ~57% and steady-state RSS ~22% in the 200-turn fakeFetch probe (RSS 256MB → 181MB at log.len=800).

Users were reporting node memory growing fast + high CPU on CLI sessions; profiling on main showed the tokenizer was the dominant cost:

before after
200-turn probe CPU profile total 6129 ms 2667 ms
300-turn steady-state RSS 244 MB 191 MB
bpeEncode self 29.5% 1.6%
estimateTurnStart total 16.4% <0.5%
truncateForModelByTokens total 37.2% <0.1%

Why this regressed after v0.48.0

#1642 / #1646 collapsed the conditional preflight into an unconditional estimateTurnStart that runs every turn. That surfaced an underlying cost: the pure-TS BPE port has no caching and was being called on growing logs every turn. Pre-0.48 the preflight only ran when the model said context was hot, so the cost stayed invisible.

What changed

  • bpeEncode — in-place splice instead of slice/spread rebuild on every merge, plus an 8K-entry LRU cache. Repetitive tool output (padded payloads, repeated identifiers in code) was re-encoding the same byte-level chunks thousands of times per session. Cache caps at ~400KB.

  • estimateConversationTokens — drop the full formatDeepSeekPrompt rebuild. Sum per-message bounded counts with a fixed template overhead, gated by a content-string-keyed 4K LRU. Same conversation entry now tokenizes once over its lifetime instead of once per turn. The estimate drives fold thresholds (50% / 75% of ctx) where ±5% slop is harmless.

  • truncateForModelByTokens — sample-based fast path. For inputs in the [maxTokens, maxTokens*4] range the old code unconditionally tokenized the full string. Now we use a 2KB-sample estimate with a 1.15× safety margin; only borderline cases fall through to a precise tokenize.

Probes added

So this is reproducible going forward:

  • scripts/probe-mem-leak.mts — drives CacheFirstLoop through N turns with a fakeFetch, samples RSS / heap / log size every K turns. No API key needed.
  • scripts/probe-jobs-leak.mts — confirms JobRegistry.MAX_COMPLETED_JOBS cap evicts (it does).
  • scripts/analyze-cpuprofile.mjs — flat self/total time roll-up for any .cpuprofile from --cpu-prof or reasonix code --profile.

Test plan

  • 3625/3637 existing tests pass (12 pre-existing skips)
  • Comment-policy gate clean
  • Re-ran probe before + after — numbers above
  • Verified bounded LRU caps under 500-turn cache-defeating stress

…uncate

Three changes that together cut per-turn CPU ~57% and steady-state RSS
~22% in the 200-turn fakeFetch probe (rss=256MB→181MB at log.len=800).

- bpeEncode: in-place splice instead of slice/spread rebuild on every
  merge, plus 8K-entry LRU cache. Repetitive tool output (padded
  payloads, identifiers in code) re-encodes the same byte-level chunks
  thousands of times per session; the cache caps that at ~400KB.

- estimateConversationTokens: drop the full formatDeepSeekPrompt
  rebuild + single bounded tokenize. Sum per-message bounded counts
  with a fixed template overhead, gated by a content-string-keyed
  4K-entry LRU. Same entry tokenizes once over its lifetime instead of
  once per turn. The estimate drives fold thresholds (50%/75% of ctx)
  where ±5% slop is harmless.

- truncateForModelByTokens: sample-based fast path. For inputs in the
  [maxTokens, maxTokens*4] range the old code unconditionally tokenized
  the full string (37% total CPU on the 200-turn probe). Now we use a
  2KB-sample estimate with a 1.15x safety margin; only borderline cases
  fall through to a precise tokenize.

Regression origin: #1642/#1646 collapsed the conditional preflight into
an unconditional estimateTurnStart that runs every turn, surfacing the
underlying tokenizer cost. The tokenizer itself has always been a
pure-TS BPE port without caching — fine when called rarely, expensive
when called on every turn against growing logs.

Also adds three probes that reproduce + measure:
- scripts/probe-mem-leak.mts — drives CacheFirstLoop through N turns
  with fakeFetch, samples RSS/heap/log
- scripts/probe-jobs-leak.mts — confirms JobRegistry's MAX_COMPLETED_JOBS
  cap actually evicts
- scripts/analyze-cpuprofile.mjs — flat self/total roll-up for any
  .cpuprofile produced by --cpu-prof or `reasonix code --profile`
@esengine esengine merged commit d6aeafa into main May 25, 2026
3 checks passed
@esengine esengine deleted the perf/tokenizer-cache-and-truncate-fastpath branch May 25, 2026 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant