memorypack

Compress a folder of markdown files into a small, structured summary that fits in an LLM's context window — without losing the ability to look up specifics.

The Problem

You have a knowledge base — maybe 40 markdown files about your goals, health, career, projects, whatever. You want an AI to use all of it when answering questions. But dumping 80 raw text chunks into every prompt wastes tokens and most of it isn't relevant to the question.

What This Does

memorypack reads your markdown files and produces a three-tier summary:

Tier 0 — Overview       (~200 tokens)   One paragraph covering everything.
Tier 1 — Topic summaries (~100 tokens each)   One paragraph per cluster of related content.
Tier 2 — Key facts      (bullet points)   Specific details extracted from each cluster.

The original files stay untouched. The summaries are an additional layer — you can still search the raw text when you need a specific detail.

How It Works

There are 6 steps. Here's what each one does in plain English:

Step 1: Chunk

Split every markdown file into small pieces (~512 tokens each). Break at headings and paragraph boundaries so each piece is about one thing.

Step 2: Embed

Turn each chunk into a 384-number fingerprint (a vector) using a small local model (all-MiniLM-L6-v2). Chunks about similar topics get similar fingerprints.

Step 3: Deduplicate

Compare every pair of chunks by their fingerprints. If two chunks are ≥92% similar, they're saying the same thing — keep the longer one, drop the other.

Step 4: Cluster

Group the remaining chunks by similarity using k-means clustering. The algorithm automatically picks how many groups to make (2–10) by testing different values and keeping the one where groups are most internally coherent (silhouette score).

Each cluster gets a label pulled from the markdown headings of its chunks.

Step 5: Summarize

Send each cluster to a summarizer to get a ~100 token paragraph. Then summarize all the summaries into one ~200 token overview.

Step 6: Extract Facts

Scan each cluster for sentences that contain specific, useful information — things with proper nouns, numbers, or actionable language ("always", "must", "prefers"). Filter out vague transitions ("However, this...", "It should be noted..."). Keep up to 10 facts per cluster.

Output

A single markdown file (or multi-file format) with this structure:

# Knowledge Base: My Life
> Compressed by memorypack | 40 files | 8,820 → 1,400 tokens (6.3:1)

## Overview
One paragraph covering all major themes...

## Topics
### Health & Fitness
Summary paragraph about health-related content...

### Career & Projects
Summary paragraph about career-related content...

## Facts
### Health & Fitness
- Specific fact 1
- Specific fact 2

### Career & Projects
- Specific fact 1
- Specific fact 2

Usage (Python CLI)

pip install -e .

`compress` — Build a knowledge base

memorypack compress ~/path/to/markdown/files -o compressed/

Options:

-o, --output DIR             Output directory (default: "compressed")
--format [single|multi]      One file or separate files
--device [cpu|cuda|mps]      GPU acceleration
--compression-target FLOAT   Target compression ratio (default: 6.0)
--chunk-size INT             Target tokens per chunk (default: 512)
--topic STR                  Name for the output header

`prune` — Shrink an existing knowledge base

Remove low-value clusters, merge near-duplicates, or enforce a token budget on a previously compressed knowledge_base.md.

# Enforce a 1500-token budget
memorypack prune compressed/knowledge_base.md --max-tokens 1500

# Drop clusters scoring below 0.3 importance, write to a new file
memorypack prune compressed/knowledge_base.md --min-importance 0.3 -o pruned.md

# Preview what would be removed without writing
memorypack prune compressed/knowledge_base.md --max-tokens 1000 --dry-run

# Skip near-duplicate merging
memorypack prune compressed/knowledge_base.md --max-tokens 1200 --no-merge

Options:

--max-tokens INT             Token budget (0 = no limit)
--min-importance FLOAT       Drop clusters below this score [0-1]
--similarity-threshold FLOAT Cosine threshold for duplicate detection (default: 0.80)
--no-merge                   Disable near-duplicate merging
--dry-run                    Print plan without writing
--device [cpu|cuda|mps]      GPU acceleration
-o, --output PATH            Output file (default: overwrite input)

`watch` — Auto-recompress on file changes

Poll a directory for .md changes and re-run compression automatically. Optionally auto-prune if output exceeds a token budget.

# Watch with 30-second interval, auto-prune at 2000 tokens
memorypack watch ~/knowledge/ --interval 30 --token-budget 2000 -o compressed/

# Watch with defaults (60s interval, no budget)
memorypack watch ~/knowledge/

Options:

--interval INT               Polling interval in seconds (default: 60)
--token-budget INT           Auto-prune threshold (0 = no limit)
--device [cpu|cuda|mps]      GPU acceleration
--topic STR                  Name for the output header
-o, --output DIR             Output directory (default: "compressed")

How It Fits Into a Larger System

memorypack produces static summaries. In a real app, you'd combine it with hybrid search — both semantic (vector) and full-text (BM25) — to get the best of both worlds:

You ask: "How's my health progress?"

Four things happen:

1. COMPRESSED SUMMARIES (memorypack output)
   → Overview paragraph (always loaded, ~200 tokens)
   → Health cluster summary + facts (~150 tokens)

2. HYBRID SEARCH (Vector + BM25)
   → Vector: embed your question, find semantically similar chunks
   → BM25: rank chunks by term frequency × inverse document frequency
   → Merge both result lists via Reciprocal Rank Fusion (RRF, k=60)
   → Skip chunks that overlap with the summaries above

3. CLAUDE CODE SESSION INDEX
   → 968+ JSONL session files parsed and chunked
   → Past decisions, conversations, and context are searchable
   → Sessions indexed alongside memory files in both vector and BM25

4. SCORED MEMORIES (separate system)
   → Short extracted facts from past conversations
   → Matched by keyword/topic relevance

All four get stuffed into the LLM prompt.
The summaries cover the broad picture cheaply.
Hybrid search fills in specifics — BM25 catches exact terms that
  vector search misses, vector search catches meaning that BM25 misses.
Session history surfaces past decisions and context.
Nothing is lost — the originals are always one search away.

Why Hybrid Search?

Each search method has blind spots:

Method	Good at	Misses
Vector (cosine)	Semantic similarity, paraphrases, conceptual matches	Exact terms, rare proper nouns, specific numbers
BM25 (TF-IDF)	Exact keyword matches, rare terms, specific names	Synonyms, paraphrases, conceptual queries
Hybrid (RRF)	Both	Documents appearing in both lists get boosted scores

RRF merging is simple and effective: score(doc) = Σ 1/(60 + rank_i(doc)). A document ranked #1 in both lists scores higher than one ranked #1 in only one.

Adaptive RAG Budget

When summaries match the question well (high cosine similarity between question and cluster centroids), fewer raw chunks are needed:

Summary match quality	Raw chunks fetched
Strong (sim ≥ 0.6)	5 chunks
Medium (sim ≥ 0.45)	8 chunks
Weak (sim < 0.45)	12-15 chunks

This saves 500-1,500 tokens per query without losing coverage.

Architecture

Compression Pipeline (memorypack)

~/.claude/memory/
├── domains/
│   ├── health/
│   │   ├── profile.md
│   │   ├── goals.md
│   │   └── current_state.md
│   ├── career/
│   └── financial/
└── ...
        │
        ▼
┌──────────────┐
│   CHUNKER    │  Split by headings, ~512 tokens each
└──────┬───────┘
       ▼
┌──────────────┐
│  EMBEDDER    │  all-MiniLM-L6-v2, 384-dim vectors
└──────┬───────┘
       ▼
┌──────────────┐
│   DEDUP      │  Union-find, cosine ≥ 0.92 → keep longest
└──────┬───────┘
       ▼
┌──────────────┐
│  CLUSTER     │  K-means, auto-select k via silhouette
└──────┬───────┘
       ▼
┌──────────────┐
│ SUMMARIZE    │  ~100 tokens per cluster + ~200 token overview
└──────┬───────┘
       ▼
┌──────────────┐
│   FACTS      │  Rule-based extraction, max 10 per cluster
└──────┬───────┘
       ▼
  compressed-knowledge.json
  (or knowledge_base.md)

Search Pipeline (runtime integration)

~/.claude/memory/      ─── chunker ────┐
(293 markdown files)                    │
                                        ├── Chunk[] ──┬── LanceDB (vector, 384-dim)
~/.claude/projects/    ─── session     │              │
(968+ JSONL sessions)      parser ─────┘              └── MiniSearch (BM25, TF-IDF)
                                                             │
                                   search(query) ────────────┤
                                   │  vector results ────────┤
                                   │  BM25 results ──────────┤
                                   │  RRF merge (k=60) ──────┘
                                   └── SearchResult[] (unified interface)

Key Parameters

Parameter	Default	What it does
`chunk_size`	512	Target tokens per chunk
`dedup_threshold`	0.92	How similar two chunks must be to count as duplicates
`min_clusters`	2	Minimum number of topic groups
`max_clusters`	10-20	Maximum number of topic groups
`summary_max_tokens`	150	Length of each cluster summary
`overview_max_tokens`	250	Length of the overall overview

Dependencies

sentence-transformers — local embeddings (no API calls)
transformers + torch — BART summarization (local) or swap for an API-based summarizer
scikit-learn — clustering algorithms
nltk — sentence tokenization
click + rich — CLI interface

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
memorypack		memorypack
test_docs		test_docs
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

memorypack

The Problem

What This Does

How It Works

Step 1: Chunk

Step 2: Embed

Step 3: Deduplicate

Step 4: Cluster

Step 5: Summarize

Step 6: Extract Facts

Output

Usage (Python CLI)

`compress` — Build a knowledge base

`prune` — Shrink an existing knowledge base

`watch` — Auto-recompress on file changes

How It Fits Into a Larger System

Why Hybrid Search?

Adaptive RAG Budget

Architecture

Compression Pipeline (memorypack)

Search Pipeline (runtime integration)

Key Parameters

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

memorypack

The Problem

What This Does

How It Works

Step 1: Chunk

Step 2: Embed

Step 3: Deduplicate

Step 4: Cluster

Step 5: Summarize

Step 6: Extract Facts

Output

Usage (Python CLI)

compress — Build a knowledge base

prune — Shrink an existing knowledge base

watch — Auto-recompress on file changes

How It Fits Into a Larger System

Why Hybrid Search?

Adaptive RAG Budget

Architecture

Compression Pipeline (memorypack)

Search Pipeline (runtime integration)

Key Parameters

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`compress` — Build a knowledge base

`prune` — Shrink an existing knowledge base

`watch` — Auto-recompress on file changes

Packages