[Feature]: Lazy skill loading: remove skill listing from system prompt, use on-demand tool instead

### Problem or Use Case

When many skills are installed, all skill names + descriptions are injected into `<available_skills>` in the system prompt at every session start — regardless of whether those skills will ever be used in that session.

With 87 bundled skills, the skills block alone consumes ~1,500-2,000 tokens of the system prompt. For messaging platform users (Discord/Telegram) who pay per-token through API proxies without prompt caching support, this adds up fast — every single message re-sends the full skill listing.

The SKILL.md content is already lazy-loaded (only read when `skill_view()` is called). The bottleneck is the skill listing itself — names + descriptions — always being in the prompt

### Proposed Solution

Add a `skills.loading` config option (default: `"eager"` for backward compatibility):

```yaml
skills:
  loading: lazy    # "eager" (current behavior) or "lazy"
When lazy:

The <available_skills> block is NOT injected into the system prompt by build_skills_system_prompt()
Instead, the system prompt includes a single line: "A skills catalog is available — call list_skills() when you think a specialized skill might help."
A lightweight built-in list_skills() tool returns the same <available_skills> block on demand
Agent calls skill_view(name) as normal after discovering the relevant skill
Impact estimate:

Setup	System prompt tokens	Working headroom (200k)
87 skills always loaded	~8,500	~191k
Lazy loading	~6,500 (-2,000)	~193k
The token savings scale with the number of installed skills. Users with 50+ skills benefit most.


**Alternatives Considered:**
Per-agent skills allowlist (current workaround) — works but requires manual curation, and agents lose awareness of skills they only occasionally need.
Skill namespacing with scoped loading — more complex, requires changes to skill discovery. Lazy loading is simpler and achieves the same goal.
Removing skills entirely from prompt — too aggressive, agents would never discover skills without explicit user instruction.


## ISSUE #2 — Observation Masking

**Title:**
Observation masking for older turns: reduce conversation history tokens without LLM summarization


**Problem or Use Case:**
In multi-turn agent conversations, tool outputs (file reads, terminal output, web search results) from earlier turns are re-sent in full with every API call. This causes quadratic token growth — by turn 10, the context can contain thousands of tokens of stale tool output that the agent will never reference again.

The current context_compressor.py uses LLM-based summarization when context exceeds limits. This works but:

Costs extra tokens (the summarization call itself)
Adds latency (extra API round-trip)
Can lose important details
JetBrains Research published "The Complexity Trap" (NeurIPS 2025 DL4Code workshop, arXiv:2508.21433) showing that simple observation masking — replacing old tool outputs with a short placeholder like [output hidden, N chars] — achieves the same performance as LLM summarization at 52.7% cost reduction, with zero extra API calls.


**Proposed Solution:**
Add observation masking as a lightweight first-pass in context_compressor.py, before the existing LLM summarization kicks in.

Config option:

context:
  observation_masking: true          # default: true
  observation_masking_window: 4      # keep last N turns unmasked
Behavior:

For turns older than the masking window, replace tool/function result content with: [observation masked — {N} chars, use session_search to recall if needed]
Keep the agent's own reasoning and actions (assistant messages) intact — only mask environment observations
The existing LLM summarization remains as a second-pass safety net for when context still exceeds limits after masking
This is complementary to the existing compression — masking handles the common case cheaply, summarization handles edge cases.

Reference: https://blog.jetbrains.com/research/2025/12/efficient-context-management/ Paper: https://arxiv.org/abs/2508.21433


### Alternatives Considered

LLM summarization only (current approach) — works but costs extra tokens and adds latency. The JetBrains research shows masking is equally effective.
Aggressive context truncation — risks losing important context. Masking is more surgical — it only hides verbose tool outputs while preserving the agent's reasoning chain.
Hybrid masking + summarization — the JetBrains paper shows this can be even more effective. The proposed implementation supports this naturally since masking runs first and summarization remains as fallback.

### Feature Type

Performance / reliability

### Scope

Medium (few files, < 300 lines)

### Contribution

- [x] I'd like to implement this myself and submit a PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Lazy skill loading: remove skill listing from system prompt, use on-demand tool instead #2045

Problem or Use Case

Proposed Solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature]: Lazy skill loading: remove skill listing from system prompt, use on-demand tool instead #2045

Description

Problem or Use Case

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions