Zero-dependency auto memory compression for long-running AI agent sessions.
Inspired by AgentScope's auto memory compression mechanism.
When an agent's conversation history grows too large, agent-compress automatically summarises the older messages into a structured schema, keeps only recent context, and injects the summary as a system message — enabling infinite-context-like behaviour with any LLM.
Every LLM has a finite context window. Long agent sessions hit limits fast.
Common solutions: truncate (lose context) or sliding window (lose coherence).
agent-compress does better: summarise the past, keep the present.
| Approach | Tokens used | Context quality |
|---|---|---|
| No compression | 100 % | ✅ Full |
| Truncation | 20–30 % | ❌ Lost context |
| Sliding window | 20–30 % | |
| agent-compress | 10–20 % | ✅ Structured summary + recent |
pip install agent-compressOr from source:
git clone https://github.com/darshjme/agent-compress
cd agent-compress
pip install -e .No external dependencies. Pure Python 3.9+.
from agent_compress import MemoryCompressor, CompressionConfig
# 1. Configure
config = CompressionConfig(
trigger_threshold=50_000, # compress when estimated tokens exceed this
keep_recent=3, # keep last N messages uncompressed
summary_max_chars=2_000, # max summary length
compression_marker="__compressed__",
)
compressor = MemoryCompressor(config)
# 2. Add messages (OpenAI-style dicts, plain strings, or any object)
for msg in conversation_history:
compressor.add(msg)
# 3. Compress when needed
if compressor.needs_compression():
summary = compressor.compress()
print(f"Compressed {summary.compressed_count} messages")
print(f"Saved ~{summary.token_estimate} tokens")
print(summary.task_overview)
# 4. Get context to send to your LLM
context = compressor.get_context()
# → [{"role": "system", "content": "[COMPRESSED MEMORY SUMMARY]..."}, ...recent_msgs]
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=context,
)| Parameter | Default | Description |
|---|---|---|
trigger_threshold |
50_000 |
Token estimate that triggers compression |
keep_recent |
3 |
Number of recent messages to keep uncompressed |
summary_max_chars |
2_000 |
Max chars in the generated summary |
compression_marker |
"__compressed__" |
Key added to compressed message dicts |
chars_per_token |
4.0 |
Char-to-token ratio for estimation |
min_messages_to_compress |
4 |
Minimum messages before compression runs |
summarizer |
None |
Optional custom summarizer callable |
| Method / Property | Description |
|---|---|
.add(message) |
Add a message to the buffer |
.add_many(messages) |
Add multiple messages at once |
.needs_compression() |
Returns True if compression is needed |
.compress() |
Compresses old messages, returns CompressedSummary |
.get_context() |
Returns [system_summary_msg, ...recent_msgs] |
.total_tokens |
Estimated token count of uncompressed messages |
.compression_history |
List of CompressionRecord audit entries |
.latest_summary |
The most recent CompressedSummary |
.stats() |
Dict snapshot of compressor state |
.reset() |
Clear all state |
summary.task_overview # what was being done
summary.current_state # what has been completed
summary.discoveries # key findings, errors, decisions
summary.next_steps # outstanding work
summary.context_to_preserve # constraints, preferences, settings
summary.compressed_count # number of messages compressed
summary.token_estimate # tokens saved
summary.to_text() # render as system message string
summary.to_dict() # serialise to dictThe default summarizer is fully rule-based (no LLM required). For higher-quality summaries, plug in your own:
import openai
def llm_summarizer(messages, max_chars=2000):
from agent_compress import CompressedSummary
prompt = f"Summarise these {len(messages)} messages into a structured summary..."
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt + str(messages)},],
max_tokens=500,
)
text = response.choices[0].message.content
return CompressedSummary(
task_overview=text,
compressed_count=len(messages),
)
config = CompressionConfig(summarizer=llm_summarizer)
compressor = MemoryCompressor(config)Before compression: 8 messages, ~4,800 chars, ~1,200 tokens
After compression: 1 system summary + 3 recent messages
Summary: ~400 chars (~100 tokens)
Recent: ~1,800 chars (~450 tokens)
Compression ratio: ~54 % token reduction
agent-compress uses character-based estimation: tokens ≈ chars / chars_per_token.
Default ratio 4.0 matches GPT-4 tokenisation for typical English prose.
Adjust via CompressionConfig(chars_per_token=3.5) for code-heavy sessions.
Works with any message format:
- OpenAI
{"role": "...", "content": "..."}dicts - Anthropic-style messages
- Plain strings
- Custom objects with
.contentor.textattributes - LangChain
BaseMessageobjects - AutoGen / CrewAI message objects
git clone https://github.com/darshjme/agent-compress
cd agent-compress
pip install -e ".[dev]"
pytest tests/ -vMIT © Darshankumar Joshi