refactor(tools): compress tool spec descriptions (-28%, ~2.7k tokens/request)#1321
Merged
Conversation
Every byte of a tool's description and JSON-schema parameter blob ships in every request. The spec list had grown to 39,377 bytes (≈ 10k tokens) across 35 tools — much of it teaching guidance that was already covered in the system prompt, plus verbose schema property descriptions that restated what the parameter name already conveys. Compress the heaviest twelve tools while preserving the actually load-bearing rules (don't-use-for-A/B/C-menus on submit_plan, plan-mode gate behavior on run_command, prefix-stable warning on remember): submit_plan 3517 → 1648 (-53%) revise_plan 2554 → 1482 (-42%) create_skill 2379 → 1367 (-43%) install_skill 2293 → 1510 (-34%) search_content 2167 → 1263 (-42%) mark_step_complete 1988 → 1037 (-48%) add_mcp_server 1898 → 1213 (-36%) run_command 1862 → 1376 (-26%) todo_write 1675 → 872 (-48%) ask_choice 1807 → 1313 (-27%) remember 1812 → 1306 (-28%) run_background 1611 → 1186 (-26%) Total: 39,377 → 28,412 bytes (-28%, ≈ 2,740 tokens per request). Behaviour unchanged — guidance that lived only in tool descriptions (plan-mode interaction, ChoiceRequestedError stop semantics, prefix re-load timing) is preserved verbatim. The system prompt and tool error messages already carry the longer-form teaching content. scripts/measure-tool-sizes.mts added so future audits don't have to re-derive per-tool byte counts by hand.
esengine
added a commit
that referenced
this pull request
May 19, 2026
…st) (#1323) The system prompt was 24,387 bytes (≈ 6,100 tokens) — much of it overlapping with the tool descriptions sitting right next to it in the cache prefix. Sections like "When to propose a plan", "When to ask the user to pick", and "When to track multi-step intent" each recited rules that the tool's own description already carried. Aggressive dedup pass: - Drop the redundant "you have these filesystem tools" opening sentence — the API ships the tool list separately. - Merge the three independent submit_plan / ask_choice / todo_write sections into one short "Picking the right tool" block. - Fold "Exploration", "Trust what you already know", and "When the user wants to switch project" into shorter equivalents — same rules, no narrative. - Collapse the foreground/background section. The full how-to lives in the run_command / run_background tool descriptions; the prompt only needs the picking rule. - Compress the audit-mode rails (#610) prose around the six rails themselves. Every rail's load-bearing phrase is preserved verbatim so tests/code-prompt.test.ts still asserts on them. Result: 24,387 → 11,956 bytes (-51%, ≈ 3,100 tokens per request). Combined with PR #1320 / #1321 the cache-prefix tax per request is now ~16k tokens instead of ~36k. Behaviour unchanged — every rail / gate / mode constraint is still asserted by the existing prompt tests. Co-authored-by: reasonix <reasonix@deepseek.com>
4 tasks
ChasLui
pushed a commit
to ChasLui/DeepSeek-Reasonix
that referenced
this pull request
May 23, 2026
…request) (esengine#1321) Every byte of a tool's description and JSON-schema parameter blob ships in every request. The spec list had grown to 39,377 bytes (≈ 10k tokens) across 35 tools — much of it teaching guidance that was already covered in the system prompt, plus verbose schema property descriptions that restated what the parameter name already conveys. Compress the heaviest twelve tools while preserving the actually load-bearing rules (don't-use-for-A/B/C-menus on submit_plan, plan-mode gate behavior on run_command, prefix-stable warning on remember): submit_plan 3517 → 1648 (-53%) revise_plan 2554 → 1482 (-42%) create_skill 2379 → 1367 (-43%) install_skill 2293 → 1510 (-34%) search_content 2167 → 1263 (-42%) mark_step_complete 1988 → 1037 (-48%) add_mcp_server 1898 → 1213 (-36%) run_command 1862 → 1376 (-26%) todo_write 1675 → 872 (-48%) ask_choice 1807 → 1313 (-27%) remember 1812 → 1306 (-28%) run_background 1611 → 1186 (-26%) Total: 39,377 → 28,412 bytes (-28%, ≈ 2,740 tokens per request). Behaviour unchanged — guidance that lived only in tool descriptions (plan-mode interaction, ChoiceRequestedError stop semantics, prefix re-load timing) is preserved verbatim. The system prompt and tool error messages already carry the longer-form teaching content. scripts/measure-tool-sizes.mts added so future audits don't have to re-derive per-tool byte counts by hand. Co-authored-by: reasonix <reasonix@deepseek.com>
ChasLui
pushed a commit
to ChasLui/DeepSeek-Reasonix
that referenced
this pull request
May 23, 2026
…st) (esengine#1323) The system prompt was 24,387 bytes (≈ 6,100 tokens) — much of it overlapping with the tool descriptions sitting right next to it in the cache prefix. Sections like "When to propose a plan", "When to ask the user to pick", and "When to track multi-step intent" each recited rules that the tool's own description already carried. Aggressive dedup pass: - Drop the redundant "you have these filesystem tools" opening sentence — the API ships the tool list separately. - Merge the three independent submit_plan / ask_choice / todo_write sections into one short "Picking the right tool" block. - Fold "Exploration", "Trust what you already know", and "When the user wants to switch project" into shorter equivalents — same rules, no narrative. - Collapse the foreground/background section. The full how-to lives in the run_command / run_background tool descriptions; the prompt only needs the picking rule. - Compress the audit-mode rails (esengine#610) prose around the six rails themselves. Every rail's load-bearing phrase is preserved verbatim so tests/code-prompt.test.ts still asserts on them. Result: 24,387 → 11,956 bytes (-51%, ≈ 3,100 tokens per request). Combined with PR esengine#1320 / esengine#1321 the cache-prefix tax per request is now ~16k tokens instead of ~36k. Behaviour unchanged — every rail / gate / mode constraint is still asserted by the existing prompt tests. Co-authored-by: reasonix <reasonix@deepseek.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Every byte of a tool's description and JSON-schema parameter blob ships in every request — so the size of
tools.specs()is a per-request cache-prefix tax. The list had grown to 39,377 bytes (~10k tokens) across 35 tools, with much of the bulk being:submit_planvsask_choice, when to spawn a subagent);This PR tightens the heaviest twelve descriptions:
Total: 39,377 → 28,412 bytes (-28%, ≈ 2,740 tokens per request).
What is preserved
The behaviourally load-bearing rules stay verbatim:
submit_plan: don't-use-for-A/B/C-menus / use ask_choice for branches.run_command: the chain operator + redirect support list, thecddoesn't persist warning, the filter-at-source hint.remember: the "won't re-load until next /new" warning.ChoiceRequestedError's stop-calling-tools semantics (this lives in the error object, not the description).The longer-form teaching content already lives in the system prompt (
# When to propose a plan,# When to ask the user to pick,# When to track multi-step intent,# Foreground vs. background commands), so trimming the duplication in tool descriptions is pure dedup.This is PR #2 of a four-PR token-optimization series. PR #1 (#1320) added a regression net at
tests/prompt-budget.test.tsthat locks current values; the new floor here will be tightened in a follow-up commit once both land.Diagnostic script
scripts/measure-tool-sizes.mtsis added so future audits can runnpx tsx scripts/measure-tool-sizes.mtsand get a per-tool byte breakdown without re-deriving the table by hand.Test plan
npm run verify— all 230 test files / 3,237 tests passnpm run lintcleannpm run typecheckcleantests/tools.test.ts,tests/plan.test.ts,tests/skills.test.ts,tests/choice.test.tsall green