Skip to main content
Morph Compact
Drop filler from chat history and code context at 33,000 tok/s. 50-70% reduction, every surviving line byte-for-byte identical to input.
Compaction works by deleting entire lines from the input — it never rewrites or paraphrases. This means if more than ~10% of the context you feed in lives on a single line, compaction cannot selectively trim within that line and results will be poor. Split long single-line payloads (e.g., minified code or giant JSON blobs) into multiple lines before compacting.
Modelmorph-compactor
Speed33,000 tok/s
Context window1M tokens
Typical reduction50-70% fewer tokens
OutputVerbatim lines from input (no rewriting)

Quick Start

Logged in? Your API key auto-fills in the code blocks below. Otherwise, get it from your dashboard.
import { MorphClient } from '@morphllm/morphsdk';

const morph = new MorphClient({ apiKey: "YOUR_API_KEY" });

const result = await morph.compact({
  input: chatHistory,
  query: "How do I validate JWT tokens?",
});

// Pass compressed history to your LLM
const response = await anthropic.messages.create({
  model: "claude-sonnet-4-5-20250929",
  messages: [
    { role: "user", content: result.output },
    { role: "user", content: "How do I validate JWT tokens?" },
  ],
});

Query-Conditioned Compression

The query parameter tells the model what matters. The model scores every line’s relevance to that query, then drops lines below the threshold.
// Same chat history, different queries, different output
const forAuth = await morph.compact({
  input: chatHistory,
  query: "JWT token validation",
});
// DB setup and CSS discussion dropped, auth code kept

const forDB = await morph.compact({
  input: chatHistory,
  query: "database connection pooling",
});
// Auth code dropped, DB setup kept
Without query, the model auto-detects from the last user message. Explicit queries give tighter compression.

Line Ranges and Markers

By default, each message includes compacted_line_ranges (which lines were removed) and (filtered N lines) markers in the text. Both are configurable:
// Default: markers + ranges
const result = await morph.compact({
  input: codeFile,
  query: "auth middleware",
  compressionRatio: 0.5,
  preserveRecent: 0,
});

console.log(result.output);
// def authenticate():
//     ...
// (filtered 12 lines)
// def handle_request():
//     ...

for (const r of result.messages[0].compacted_line_ranges) {
  console.log(`lines ${r.start}-${r.end} removed`);
}

// No markers: empty lines instead of "(filtered N lines)"
await morph.compact({ input: codeFile, includeMarkers: false });

// No line ranges: skip tracking removed ranges
await morph.compact({ input: codeFile, includeLineRanges: false });
// result.messages[0].compacted_line_ranges === []

Preserving Critical Context

Wrap sections you never want compressed in <keepContext> / </keepContext> tags. Tagged content survives compression verbatim regardless of the compression ratio.
const input = `
// Database connection setup
const pool = new Pool({ host: 'localhost', port: 5432 });

<keepContext>
// CRITICAL: Auth middleware - do not compress
function authenticate(req, res, next) {
  const token = req.headers.authorization?.split(' ')[1];
  if (!token) return res.status(401).json({ error: 'No token' });
  const decoded = jwt.verify(token, process.env.JWT_SECRET);
  req.user = decoded;
  next();
}
</keepContext>

// Logging utilities
function logRequest(req) { console.log(req.method, req.path); }
function logError(err) { console.error(err.stack); }
// ... 200 more lines of helpers
`;

const result = await morph.compact({
  input,
  query: "authentication",
  compressionRatio: 0.3,
});

// The authenticate() function is fully preserved.
// DB setup and logging helpers are compressed.
// The <keepContext> tags themselves are stripped from output.
Rules:
  • Tags must be on their own line (no inline code() <keepContext>)
  • Tags must open and close within the same message
  • Kept content counts against the compression_ratio budget. If you keep 40% and request 0.5, the remaining 60% compresses harder to hit the target.
  • Unclosed <keepContext> preserves everything from the tag to the end of the message
The response includes kept_line_ranges showing which lines were force-preserved:
for (const r of result.messages[0].kept_line_ranges) {
  console.log(`lines ${r.start}-${r.end} preserved via keepContext`);
}

API Reference

POST /v1/compact

The primary endpoint. Accepts string input or message arrays. Parameters
ParameterTypeDefaultDescription
inputstring or array-Text or {role, content} array. One of input/messages required.
messagesarray-{role, content} messages. Takes priority over input.
querystringauto-detectedFocus query for relevance-based pruning
compression_ratiofloat0.5Fraction to keep. 0.3 = aggressive, 0.7 = light
preserve_recentint2Keep last N messages uncompressed
compress_system_messagesboolfalseWhen true, system messages are also compressed. By default they are preserved verbatim.
include_line_rangesbooltrueInclude compacted_line_ranges in response
include_markersbooltrueInclude (filtered N lines) text markers. When false, gaps become empty lines
modelstringmorph-compactorModel ID
Response
{
  "id": "cmpr-7373faf8af65",
  "object": "compact",
  "model": "morph-compactor",
  "output": "def hello():\n    print(\"hello world\")\n(filtered 6 lines)\ndef world():\n    return 42",
  "messages": [
    {
      "role": "user",
      "content": "def hello():\n    print(\"hello world\")\n(filtered 6 lines)\ndef world():\n    return 42",
      "compacted_line_ranges": [{ "start": 5, "end": 10 }],
      "kept_line_ranges": []
    }
  ],
  "usage": {
    "input_tokens": 101,
    "output_tokens": 65,
    "compression_ratio": 0.644,
    "processing_time_ms": 109
  }
}

POST /v1/chat/completions

OpenAI Chat Completions format. Drop-in replacement for any OpenAI-compatible client pointed at https://api.morphllm.com/v1. Supports streaming via stream: true.
ParameterTypeRequiredDescription
modelstringYesmorph-compactor
messagesarrayYes{role, content} message array
compression_ratiofloatNoFraction to keep (default 0.5)
querystringNoFocus query for relevance-based pruning
streamboolNoEnable SSE streaming
{
  "id": "cmpr-def456",
  "object": "chat.completion",
  "model": "morph-compactor",
  "choices": [{
    "index": 0,
    "message": { "role": "assistant", "content": "compressed text..." },
    "finish_reason": "stop"
  }],
  "usage": { "prompt_tokens": 4200, "completion_tokens": 1800, "total_tokens": 6000 }
}

POST /v1/responses

OpenAI Responses API format. Works with OpenAI SDK v5+ (TS) or v1.66+ (Python) pointed at https://api.morphllm.com/v1.
ParameterTypeRequiredDescription
modelstringYesmorph-compactor
inputstring or arrayYesText or {role, content} array
querystringNoFocus query for relevance-based pruning
{
  "id": "cmpr-abc123",
  "object": "response",
  "model": "morph-compactor",
  "output": [{
    "type": "message",
    "role": "assistant",
    "content": [{ "type": "output_text", "text": "compressed text..." }]
  }],
  "usage": { "input_tokens": 4200, "output_tokens": 1800 }
}

Errors

StatusMeaning
400Malformed request or input too large
401Invalid API key
503Model not loaded
504Request timed out

SDK Reference

CompactInput
{
  input?: string | Array<{ role: string, content: string }>,
  messages?: Array<{ role: string, content: string }>,
  query?: string,
  compressionRatio?: number,    // 0.05-1.0, default 0.5
  preserveRecent?: number,      // default 2
  includeLineRanges?: boolean,  // default true
  includeMarkers?: boolean,     // default true
  model?: string,
}
CompactResult
{
  id: string,
  output: string,              // all messages joined
  messages: Array<{
    role: string,
    content: string,
    compacted_line_ranges: Array<{ start: number, end: number }>,
    kept_line_ranges: Array<{ start: number, end: number }>,  // force-preserved via <keepContext>
  }>,
  usage: { input_tokens, output_tokens, compression_ratio, processing_time_ms },
  model: string,
}
CompactConfig
{
  morphApiKey?: string,     // defaults to MORPH_API_KEY env
  morphApiUrl?: string,
  timeout?: number,         // defaults to 120000 (2 min)
  retryConfig?: RetryConfig,
  debug?: boolean,
}

Edge / Cloudflare Workers

import { CompactClient } from '@morphllm/morphsdk/edge';

export default {
  async fetch(request: Request, env: Env) {
    const compact = new CompactClient({ morphApiKey: env.MORPH_API_KEY });
    const { input, query } = await request.json();

    const result = await compact.compact({ input, query });
    return Response.json({ output: result.output, usage: result.usage });
  }
};

Best Practices

Keep recent messages verbatim

Set preserve_recent to at least 3. Recent turns contain the user’s active intent and the assistant’s latest reasoning. Compacting them risks dropping context the LLM needs right now.

Always pass a query

Without it, the model falls back to auto-detection from the last user message. An explicit query gives tighter, more relevant compression because the model knows exactly which lines to score.

Compact before the LLM call

The value is in reducing what you send to your LLM. Compacting a response after generation saves storage but doesn’t cut inference cost.

Tune compression_ratio

Default 0.5 is a good starting point. Agent loops past 100 turns: try 0.3. Shorter conversations where nuance matters: try 0.7.
const result = await morph.compact({
  input: chatHistory,
  query: "rate limiting",
  preserveRecent: 3,  // last 3 messages pass through untouched
});

FAQ

Summarization rewrites. Compaction deletes. Every sentence that survives compaction is character-for-character identical to your original. No drift, no hallucinated context.
Agent loops past 50 turns. Multi-session chat where you carry history forward. Any pipeline where you pay for tokens that are greetings and filler.
Both work. A conversation array preserves role structure through compaction.
Yes. Point baseURL at https://api.morphllm.com/v1 and call client.chat.completions.create() with model: "morph-compactor". The Responses API (client.responses.create()) also works.
33,000 tok/s. 100K tokens compresses in under 2 seconds. 180K tokens in about 3 seconds.

Notes for AI Coding Agents

If you are an AI coding agent integrating Morph Compact into a project, follow these steps:1. Understand the existing compaction logic. Before writing any code, read the project’s current context management. This might be in pre-commit hooks, middleware, a context window manager, or a custom prompt builder. Search for keywords like compact, compress, truncate, summarize, or context in the codebase.2. Choose the right integration method.
  • If the project uses TypeScript/JavaScript, use the @morphllm/morphsdk package — it handles retries, types, and edge runtime support out of the box.
  • If the project uses Python with OpenAI SDK, point the client at https://api.morphllm.com/v1 and call chat.completions.create() with model: "morph-compactor".
  • For any other language, call POST /v1/compact directly with a JSON body containing input (or messages) and optionally query.
3. Follow best practices.
  • Always pass query — set it to the user’s current question or task description for much better compression.
  • Set preserve_recent to at least 3 to keep the latest turns intact.
  • Compact before sending to the LLM, not after — the goal is to reduce inference cost.
  • Use compression_ratio: 0.3 for long agent loops (100+ turns), 0.7 for shorter conversations.
  • Wrap critical context in <keepContext> tags if certain sections must never be removed.