Problem or Use Case
When many skills are installed, all skill names + descriptions are injected into <available_skills> in the system prompt at every session start — regardless of whether those skills will ever be used in that session.
With 87 bundled skills, the skills block alone consumes ~1,500-2,000 tokens of the system prompt. For messaging platform users (Discord/Telegram) who pay per-token through API proxies without prompt caching support, this adds up fast — every single message re-sends the full skill listing.
The SKILL.md content is already lazy-loaded (only read when skill_view() is called). The bottleneck is the skill listing itself — names + descriptions — always being in the prompt
Proposed Solution
Add a skills.loading config option (default: "eager" for backward compatibility):
skills:
loading: lazy # "eager" (current behavior) or "lazy"
When lazy:
The <available_skills> block is NOT injected into the system prompt by build_skills_system_prompt()
Instead, the system prompt includes a single line: "A skills catalog is available — call list_skills() when you think a specialized skill might help."
A lightweight built-in list_skills() tool returns the same <available_skills> block on demand
Agent calls skill_view(name) as normal after discovering the relevant skill
Impact estimate:
Setup System prompt tokens Working headroom (200k)
87 skills always loaded ~8,500 ~191k
Lazy loading ~6,500 (-2,000) ~193k
The token savings scale with the number of installed skills. Users with 50+ skills benefit most.
**Alternatives Considered:**
Per-agent skills allowlist (current workaround) — works but requires manual curation, and agents lose awareness of skills they only occasionally need.
Skill namespacing with scoped loading — more complex, requires changes to skill discovery. Lazy loading is simpler and achieves the same goal.
Removing skills entirely from prompt — too aggressive, agents would never discover skills without explicit user instruction.
## ISSUE #2 — Observation Masking
**Title:**
Observation masking for older turns: reduce conversation history tokens without LLM summarization
**Problem or Use Case:**
In multi-turn agent conversations, tool outputs (file reads, terminal output, web search results) from earlier turns are re-sent in full with every API call. This causes quadratic token growth — by turn 10, the context can contain thousands of tokens of stale tool output that the agent will never reference again.
The current context_compressor.py uses LLM-based summarization when context exceeds limits. This works but:
Costs extra tokens (the summarization call itself)
Adds latency (extra API round-trip)
Can lose important details
JetBrains Research published "The Complexity Trap" (NeurIPS 2025 DL4Code workshop, arXiv:2508.21433) showing that simple observation masking — replacing old tool outputs with a short placeholder like [output hidden, N chars] — achieves the same performance as LLM summarization at 52.7% cost reduction, with zero extra API calls.
**Proposed Solution:**
Add observation masking as a lightweight first-pass in context_compressor.py, before the existing LLM summarization kicks in.
Config option:
context:
observation_masking: true # default: true
observation_masking_window: 4 # keep last N turns unmasked
Behavior:
For turns older than the masking window, replace tool/function result content with: [observation masked — {N} chars, use session_search to recall if needed]
Keep the agent's own reasoning and actions (assistant messages) intact — only mask environment observations
The existing LLM summarization remains as a second-pass safety net for when context still exceeds limits after masking
This is complementary to the existing compression — masking handles the common case cheaply, summarization handles edge cases.
Reference: https://blog.jetbrains.com/research/2025/12/efficient-context-management/ Paper: https://arxiv.org/abs/2508.21433
### Alternatives Considered
LLM summarization only (current approach) — works but costs extra tokens and adds latency. The JetBrains research shows masking is equally effective.
Aggressive context truncation — risks losing important context. Masking is more surgical — it only hides verbose tool outputs while preserving the agent's reasoning chain.
Hybrid masking + summarization — the JetBrains paper shows this can be even more effective. The proposed implementation supports this naturally since masking runs first and summarization remains as fallback.
### Feature Type
Performance / reliability
### Scope
Medium (few files, < 300 lines)
### Contribution
- [x] I'd like to implement this myself and submit a PR
Problem or Use Case
When many skills are installed, all skill names + descriptions are injected into
<available_skills>in the system prompt at every session start — regardless of whether those skills will ever be used in that session.With 87 bundled skills, the skills block alone consumes ~1,500-2,000 tokens of the system prompt. For messaging platform users (Discord/Telegram) who pay per-token through API proxies without prompt caching support, this adds up fast — every single message re-sends the full skill listing.
The SKILL.md content is already lazy-loaded (only read when
skill_view()is called). The bottleneck is the skill listing itself — names + descriptions — always being in the promptProposed Solution
Add a
skills.loadingconfig option (default:"eager"for backward compatibility):