perf(tokenizer): cache BPE + bounded counts, fast-path truncate (-57% CPU, -22% RSS) by esengine · Pull Request #1741 · esengine/DeepSeek-Reasonix

esengine · 2026-05-25T07:11:12Z

Summary

Three small changes that together cut per-turn CPU ~57% and steady-state RSS ~22% in the 200-turn fakeFetch probe (RSS 256MB → 181MB at log.len=800).

Users were reporting node memory growing fast + high CPU on CLI sessions; profiling on main showed the tokenizer was the dominant cost:

	before	after
200-turn probe CPU profile total	6129 ms	2667 ms
300-turn steady-state RSS	244 MB	191 MB
`bpeEncode` self	29.5%	1.6%
`estimateTurnStart` total	16.4%	<0.5%
`truncateForModelByTokens` total	37.2%	<0.1%

Why this regressed after v0.48.0

#1642 / #1646 collapsed the conditional preflight into an unconditional estimateTurnStart that runs every turn. That surfaced an underlying cost: the pure-TS BPE port has no caching and was being called on growing logs every turn. Pre-0.48 the preflight only ran when the model said context was hot, so the cost stayed invisible.

What changed

bpeEncode — in-place splice instead of slice/spread rebuild on every merge, plus an 8K-entry LRU cache. Repetitive tool output (padded payloads, repeated identifiers in code) was re-encoding the same byte-level chunks thousands of times per session. Cache caps at ~400KB.
estimateConversationTokens — drop the full formatDeepSeekPrompt rebuild. Sum per-message bounded counts with a fixed template overhead, gated by a content-string-keyed 4K LRU. Same conversation entry now tokenizes once over its lifetime instead of once per turn. The estimate drives fold thresholds (50% / 75% of ctx) where ±5% slop is harmless.
truncateForModelByTokens — sample-based fast path. For inputs in the [maxTokens, maxTokens*4] range the old code unconditionally tokenized the full string. Now we use a 2KB-sample estimate with a 1.15× safety margin; only borderline cases fall through to a precise tokenize.

Probes added

So this is reproducible going forward:

scripts/probe-mem-leak.mts — drives CacheFirstLoop through N turns with a fakeFetch, samples RSS / heap / log size every K turns. No API key needed.
scripts/probe-jobs-leak.mts — confirms JobRegistry.MAX_COMPLETED_JOBS cap evicts (it does).
scripts/analyze-cpuprofile.mjs — flat self/total time roll-up for any .cpuprofile from --cpu-prof or reasonix code --profile.

Test plan

3625/3637 existing tests pass (12 pre-existing skips)
Comment-policy gate clean
Re-ran probe before + after — numbers above
Verified bounded LRU caps under 500-turn cache-defeating stress

…uncate Three changes that together cut per-turn CPU ~57% and steady-state RSS ~22% in the 200-turn fakeFetch probe (rss=256MB→181MB at log.len=800). - bpeEncode: in-place splice instead of slice/spread rebuild on every merge, plus 8K-entry LRU cache. Repetitive tool output (padded payloads, identifiers in code) re-encodes the same byte-level chunks thousands of times per session; the cache caps that at ~400KB. - estimateConversationTokens: drop the full formatDeepSeekPrompt rebuild + single bounded tokenize. Sum per-message bounded counts with a fixed template overhead, gated by a content-string-keyed 4K-entry LRU. Same entry tokenizes once over its lifetime instead of once per turn. The estimate drives fold thresholds (50%/75% of ctx) where ±5% slop is harmless. - truncateForModelByTokens: sample-based fast path. For inputs in the [maxTokens, maxTokens*4] range the old code unconditionally tokenized the full string (37% total CPU on the 200-turn probe). Now we use a 2KB-sample estimate with a 1.15x safety margin; only borderline cases fall through to a precise tokenize. Regression origin: #1642/#1646 collapsed the conditional preflight into an unconditional estimateTurnStart that runs every turn, surfacing the underlying tokenizer cost. The tokenizer itself has always been a pure-TS BPE port without caching — fine when called rarely, expensive when called on every turn against growing logs. Also adds three probes that reproduce + measure: - scripts/probe-mem-leak.mts — drives CacheFirstLoop through N turns with fakeFetch, samples RSS/heap/log - scripts/probe-jobs-leak.mts — confirms JobRegistry's MAX_COMPLETED_JOBS cap actually evicts - scripts/analyze-cpuprofile.mjs — flat self/total roll-up for any .cpuprofile produced by --cpu-prof or `reasonix code --profile`

esengine merged commit d6aeafa into main May 25, 2026
3 checks passed

esengine deleted the perf/tokenizer-cache-and-truncate-fastpath branch May 25, 2026 07:11

This was referenced May 27, 2026

打字太卡了，越用越卡 #1880

Closed

保持每天更新本地的reasonix后，吐槽一些内容 #1845

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(tokenizer): cache BPE + bounded counts, fast-path truncate (-57% CPU, -22% RSS)#1741

perf(tokenizer): cache BPE + bounded counts, fast-path truncate (-57% CPU, -22% RSS)#1741
esengine merged 1 commit into
mainfrom
perf/tokenizer-cache-and-truncate-fastpath

esengine commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant