What is MemScore?
MemScore is a composite metric that captures three dimensions of memory provider performance in a single line:
accuracy% / latencyMs / contextTok
For example:
This tells you the provider achieved 85% accuracy, with an average search latency of 120ms, sending 1,500 tokens of context to the answering model per question.
Components
| Component | What it measures | Source |
|---|
| Quality | Answer accuracy as a percentage | (correct / total) * 100 from judge evaluations |
| Latency | Average search response time in milliseconds | Mean of all search phase durations |
| Tokens | Average context tokens sent to the answering model | Client-side token count of retrieved context per question |
MemScore is not a single number — it’s a triple. This is intentional. Collapsing quality, latency, and cost into one score hides important tradeoffs. A provider with 90% accuracy at 5,000 tokens is very different from one with 90% accuracy at 500 tokens.
How token counting works
MemoryBench counts tokens client-side using provider-specific tokenizers:
| Model provider | Tokenizer | Method |
|---|
| OpenAI | js-tiktoken | Exact count using o200k_base or cl100k_base encoding |
| Anthropic | @anthropic-ai/tokenizer | Exact count using Anthropic’s tokenizer |
| Google | Approximation | Math.ceil(text.length / 4) |
Three token values are tracked per question:
promptTokens — Total tokens in the full prompt (instructions + context + question)
basePromptTokens — Tokens in the prompt without any retrieved context
contextTokens — Tokens in just the retrieved context string
The MemScore uses contextTokens because it isolates what the memory provider actually contributed.
Where MemScore appears
CLI output
After a benchmark run completes, MemScore is printed in the summary:
SUMMARY:
Total Questions: 50
Correct: 43
Accuracy: 86.00%
Quality: 86%
Latency: 145ms (avg)
Tokens: 1,823 (avg context sent to answering model)
MemScore: 86% / 145ms / 1823tok
Web UI
The MemScore card appears at the top of the run overview page. Per-question token counts are shown next to each model answer in both the question list and detail views.
Report JSON
The report.json file includes both a display string and structured components:
{
"memscore": "86% / 145ms / 1823tok",
"memscoreComponents": {
"quality": 86,
"latencyMs": 145,
"contextTokens": 1823
},
"tokens": {
"totalTokens": 142500,
"basePromptTokens": 21000,
"contextTokens": 91150,
"avgTokensPerQuestion": 2850,
"avgBasePromptTokens": 420,
"avgContextTokens": 1823
}
}
Use memscoreComponents for programmatic comparisons — it avoids parsing the display string.
Comparing providers
MemScore is most useful when comparing providers on the same benchmark:
bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -j gpt-4o
Each provider’s report will include its own MemScore, making it easy to see tradeoffs at a glance:
| Provider | MemScore |
|---|
| Provider A | 88% / 145ms / 1200tok |
| Provider B | 82% / 80ms / 2400tok |
| Provider C | 85% / 110ms / 1800tok |
In this example, Provider A has the highest accuracy but the slowest search. Provider B is the fastest but sends the most context without achieving the best accuracy — suggesting its retrieval may be less precise. Provider C lands in the middle on all three axes. There’s no single “winner” — the right choice depends on whether you prioritize quality, speed, or token efficiency.
Backward compatibility
Runs from before MemScore was added will still work. If token data is not present in the checkpoint, the memscore, memscoreComponents, and tokens fields will be undefined in the report. The CLI and web UI gracefully skip the MemScore display when data is unavailable.