Compress a folder of markdown files into a small, structured summary that fits in an LLM's context window — without losing the ability to look up specifics.
You have a knowledge base — maybe 40 markdown files about your goals, health, career, projects, whatever. You want an AI to use all of it when answering questions. But dumping 80 raw text chunks into every prompt wastes tokens and most of it isn't relevant to the question.
memorypack reads your markdown files and produces a three-tier summary:
Tier 0 — Overview (~200 tokens) One paragraph covering everything.
Tier 1 — Topic summaries (~100 tokens each) One paragraph per cluster of related content.
Tier 2 — Key facts (bullet points) Specific details extracted from each cluster.
The original files stay untouched. The summaries are an additional layer — you can still search the raw text when you need a specific detail.
There are 6 steps. Here's what each one does in plain English:
Split every markdown file into small pieces (~512 tokens each). Break at headings and paragraph boundaries so each piece is about one thing.
Turn each chunk into a 384-number fingerprint (a vector) using a small local model (all-MiniLM-L6-v2). Chunks about similar topics get similar fingerprints.
Compare every pair of chunks by their fingerprints. If two chunks are ≥92% similar, they're saying the same thing — keep the longer one, drop the other.
Group the remaining chunks by similarity using k-means clustering. The algorithm automatically picks how many groups to make (2–10) by testing different values and keeping the one where groups are most internally coherent (silhouette score).
Each cluster gets a label pulled from the markdown headings of its chunks.
Send each cluster to a summarizer to get a ~100 token paragraph. Then summarize all the summaries into one ~200 token overview.
Scan each cluster for sentences that contain specific, useful information — things with proper nouns, numbers, or actionable language ("always", "must", "prefers"). Filter out vague transitions ("However, this...", "It should be noted..."). Keep up to 10 facts per cluster.
A single markdown file (or multi-file format) with this structure:
# Knowledge Base: My Life
> Compressed by memorypack | 40 files | 8,820 → 1,400 tokens (6.3:1)
## Overview
One paragraph covering all major themes...
## Topics
### Health & Fitness
Summary paragraph about health-related content...
### Career & Projects
Summary paragraph about career-related content...
## Facts
### Health & Fitness
- Specific fact 1
- Specific fact 2
### Career & Projects
- Specific fact 1
- Specific fact 2pip install -e .memorypack compress ~/path/to/markdown/files -o compressed/Options:
-o, --output DIR Output directory (default: "compressed")
--format [single|multi] One file or separate files
--device [cpu|cuda|mps] GPU acceleration
--compression-target FLOAT Target compression ratio (default: 6.0)
--chunk-size INT Target tokens per chunk (default: 512)
--topic STR Name for the output header
Remove low-value clusters, merge near-duplicates, or enforce a token budget on a previously compressed knowledge_base.md.
# Enforce a 1500-token budget
memorypack prune compressed/knowledge_base.md --max-tokens 1500
# Drop clusters scoring below 0.3 importance, write to a new file
memorypack prune compressed/knowledge_base.md --min-importance 0.3 -o pruned.md
# Preview what would be removed without writing
memorypack prune compressed/knowledge_base.md --max-tokens 1000 --dry-run
# Skip near-duplicate merging
memorypack prune compressed/knowledge_base.md --max-tokens 1200 --no-mergeOptions:
--max-tokens INT Token budget (0 = no limit)
--min-importance FLOAT Drop clusters below this score [0-1]
--similarity-threshold FLOAT Cosine threshold for duplicate detection (default: 0.80)
--no-merge Disable near-duplicate merging
--dry-run Print plan without writing
--device [cpu|cuda|mps] GPU acceleration
-o, --output PATH Output file (default: overwrite input)
Poll a directory for .md changes and re-run compression automatically. Optionally auto-prune if output exceeds a token budget.
# Watch with 30-second interval, auto-prune at 2000 tokens
memorypack watch ~/knowledge/ --interval 30 --token-budget 2000 -o compressed/
# Watch with defaults (60s interval, no budget)
memorypack watch ~/knowledge/Options:
--interval INT Polling interval in seconds (default: 60)
--token-budget INT Auto-prune threshold (0 = no limit)
--device [cpu|cuda|mps] GPU acceleration
--topic STR Name for the output header
-o, --output DIR Output directory (default: "compressed")
memorypack produces static summaries. In a real app, you'd combine it with hybrid search — both semantic (vector) and full-text (BM25) — to get the best of both worlds:
You ask: "How's my health progress?"
Four things happen:
1. COMPRESSED SUMMARIES (memorypack output)
→ Overview paragraph (always loaded, ~200 tokens)
→ Health cluster summary + facts (~150 tokens)
2. HYBRID SEARCH (Vector + BM25)
→ Vector: embed your question, find semantically similar chunks
→ BM25: rank chunks by term frequency × inverse document frequency
→ Merge both result lists via Reciprocal Rank Fusion (RRF, k=60)
→ Skip chunks that overlap with the summaries above
3. CLAUDE CODE SESSION INDEX
→ 968+ JSONL session files parsed and chunked
→ Past decisions, conversations, and context are searchable
→ Sessions indexed alongside memory files in both vector and BM25
4. SCORED MEMORIES (separate system)
→ Short extracted facts from past conversations
→ Matched by keyword/topic relevance
All four get stuffed into the LLM prompt.
The summaries cover the broad picture cheaply.
Hybrid search fills in specifics — BM25 catches exact terms that
vector search misses, vector search catches meaning that BM25 misses.
Session history surfaces past decisions and context.
Nothing is lost — the originals are always one search away.
Each search method has blind spots:
| Method | Good at | Misses |
|---|---|---|
| Vector (cosine) | Semantic similarity, paraphrases, conceptual matches | Exact terms, rare proper nouns, specific numbers |
| BM25 (TF-IDF) | Exact keyword matches, rare terms, specific names | Synonyms, paraphrases, conceptual queries |
| Hybrid (RRF) | Both | Documents appearing in both lists get boosted scores |
RRF merging is simple and effective: score(doc) = Σ 1/(60 + rank_i(doc)). A document ranked #1 in both lists scores higher than one ranked #1 in only one.
When summaries match the question well (high cosine similarity between question and cluster centroids), fewer raw chunks are needed:
| Summary match quality | Raw chunks fetched |
|---|---|
| Strong (sim ≥ 0.6) | 5 chunks |
| Medium (sim ≥ 0.45) | 8 chunks |
| Weak (sim < 0.45) | 12-15 chunks |
This saves 500-1,500 tokens per query without losing coverage.
~/.claude/memory/
├── domains/
│ ├── health/
│ │ ├── profile.md
│ │ ├── goals.md
│ │ └── current_state.md
│ ├── career/
│ └── financial/
└── ...
│
▼
┌──────────────┐
│ CHUNKER │ Split by headings, ~512 tokens each
└──────┬───────┘
▼
┌──────────────┐
│ EMBEDDER │ all-MiniLM-L6-v2, 384-dim vectors
└──────┬───────┘
▼
┌──────────────┐
│ DEDUP │ Union-find, cosine ≥ 0.92 → keep longest
└──────┬───────┘
▼
┌──────────────┐
│ CLUSTER │ K-means, auto-select k via silhouette
└──────┬───────┘
▼
┌──────────────┐
│ SUMMARIZE │ ~100 tokens per cluster + ~200 token overview
└──────┬───────┘
▼
┌──────────────┐
│ FACTS │ Rule-based extraction, max 10 per cluster
└──────┬───────┘
▼
compressed-knowledge.json
(or knowledge_base.md)
~/.claude/memory/ ─── chunker ────┐
(293 markdown files) │
├── Chunk[] ──┬── LanceDB (vector, 384-dim)
~/.claude/projects/ ─── session │ │
(968+ JSONL sessions) parser ─────┘ └── MiniSearch (BM25, TF-IDF)
│
search(query) ────────────┤
│ vector results ────────┤
│ BM25 results ──────────┤
│ RRF merge (k=60) ──────┘
└── SearchResult[] (unified interface)
| Parameter | Default | What it does |
|---|---|---|
chunk_size |
512 | Target tokens per chunk |
dedup_threshold |
0.92 | How similar two chunks must be to count as duplicates |
min_clusters |
2 | Minimum number of topic groups |
max_clusters |
10-20 | Maximum number of topic groups |
summary_max_tokens |
150 | Length of each cluster summary |
overview_max_tokens |
250 | Length of the overall overview |
- sentence-transformers — local embeddings (no API calls)
- transformers + torch — BART summarization (local) or swap for an API-based summarizer
- scikit-learn — clustering algorithms
- nltk — sentence tokenization
- click + rich — CLI interface