perf(tokenizer): cache BPE + bounded counts, fast-path truncate (-57% CPU, -22% RSS)#1741
Merged
Merged
Conversation
…uncate Three changes that together cut per-turn CPU ~57% and steady-state RSS ~22% in the 200-turn fakeFetch probe (rss=256MB→181MB at log.len=800). - bpeEncode: in-place splice instead of slice/spread rebuild on every merge, plus 8K-entry LRU cache. Repetitive tool output (padded payloads, identifiers in code) re-encodes the same byte-level chunks thousands of times per session; the cache caps that at ~400KB. - estimateConversationTokens: drop the full formatDeepSeekPrompt rebuild + single bounded tokenize. Sum per-message bounded counts with a fixed template overhead, gated by a content-string-keyed 4K-entry LRU. Same entry tokenizes once over its lifetime instead of once per turn. The estimate drives fold thresholds (50%/75% of ctx) where ±5% slop is harmless. - truncateForModelByTokens: sample-based fast path. For inputs in the [maxTokens, maxTokens*4] range the old code unconditionally tokenized the full string (37% total CPU on the 200-turn probe). Now we use a 2KB-sample estimate with a 1.15x safety margin; only borderline cases fall through to a precise tokenize. Regression origin: #1642/#1646 collapsed the conditional preflight into an unconditional estimateTurnStart that runs every turn, surfacing the underlying tokenizer cost. The tokenizer itself has always been a pure-TS BPE port without caching — fine when called rarely, expensive when called on every turn against growing logs. Also adds three probes that reproduce + measure: - scripts/probe-mem-leak.mts — drives CacheFirstLoop through N turns with fakeFetch, samples RSS/heap/log - scripts/probe-jobs-leak.mts — confirms JobRegistry's MAX_COMPLETED_JOBS cap actually evicts - scripts/analyze-cpuprofile.mjs — flat self/total roll-up for any .cpuprofile produced by --cpu-prof or `reasonix code --profile`
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three small changes that together cut per-turn CPU ~57% and steady-state RSS ~22% in the 200-turn fakeFetch probe (RSS 256MB → 181MB at log.len=800).
Users were reporting node memory growing fast + high CPU on CLI sessions; profiling on
mainshowed the tokenizer was the dominant cost:bpeEncodeselfestimateTurnStarttotaltruncateForModelByTokenstotalWhy this regressed after v0.48.0
#1642 / #1646 collapsed the conditional preflight into an unconditional
estimateTurnStartthat runs every turn. That surfaced an underlying cost: the pure-TS BPE port has no caching and was being called on growing logs every turn. Pre-0.48 the preflight only ran when the model said context was hot, so the cost stayed invisible.What changed
bpeEncode— in-placespliceinstead of slice/spread rebuild on every merge, plus an 8K-entry LRU cache. Repetitive tool output (padded payloads, repeated identifiers in code) was re-encoding the same byte-level chunks thousands of times per session. Cache caps at ~400KB.estimateConversationTokens— drop the fullformatDeepSeekPromptrebuild. Sum per-message bounded counts with a fixed template overhead, gated by a content-string-keyed 4K LRU. Same conversation entry now tokenizes once over its lifetime instead of once per turn. The estimate drives fold thresholds (50% / 75% of ctx) where ±5% slop is harmless.truncateForModelByTokens— sample-based fast path. For inputs in the[maxTokens, maxTokens*4]range the old code unconditionally tokenized the full string. Now we use a 2KB-sample estimate with a 1.15× safety margin; only borderline cases fall through to a precise tokenize.Probes added
So this is reproducible going forward:
scripts/probe-mem-leak.mts— drivesCacheFirstLoopthrough N turns with a fakeFetch, samples RSS / heap / log size every K turns. No API key needed.scripts/probe-jobs-leak.mts— confirmsJobRegistry.MAX_COMPLETED_JOBScap evicts (it does).scripts/analyze-cpuprofile.mjs— flat self/total time roll-up for any.cpuprofilefrom--cpu-proforreasonix code --profile.Test plan