Skip to content

research(memory): OCR-Memory — visual trajectory encoding for scalable long-horizon agent memory without lossy summarization #3571

@bug-ops

Description

@bug-ops

Description

OCR-Memory (arXiv:2604.26622, April 29, 2026) proposes a memory framework that encodes historical agent trajectories as images rather than text, using the visual modality as a high-density representation of agent experience.

Key Technical Approach

Current agent memory systems face a fundamental constraint: token budgets. Storing raw trajectories is prohibitively expensive; summarization loses information; text-only retrieval returns fragmented evidence.

OCR-Memory addresses this by:

  1. Rendering historical trajectories (tool calls, observations, reasoning chains) into annotated images with unique visual identifiers
  2. Retrieval via a locate-and-transcribe paradigm: visual anchors select relevant image regions; retrieval becomes explicit index selection rather than free-form generation
  3. Adaptive resolution and active-recall up-sampling: look far with manageable token cost while preserving high fidelity for salient memories

Key property: encoding into visual tokens avoids the trade-off between memory capacity and completeness — arbitrarily long histories can be stored without lossy summarization or truncation.

Relevance to Zeph

Zeph's current memory pipeline:

  • Short-term: sliding context window (lost on compact)
  • Long-term: MAGMA graph (structured entities/relations), SYNAPSE spreading activation
  • Episodic: scene storage (SQLite)
  • Compaction: microcompact + hard compact (lossy summarization)

OCR-Memory is not a replacement but a retrieval complement: visual trajectory snapshots could serve as a dense, lossless episodic store for complex multi-step tool workflows (e.g., a 40-turn code refactor session), where text summarization loses critical intermediate state.

Potential integration point: zeph-memory episodic store — for sessions exceeding a token threshold, render the trajectory to a compact visual artifact (JSON → canvas → PNG) and store as a Qdrant vector point with the image embedding. Retrieval identifies the relevant session snapshot; transcription extracts the needed context region.

Practical concerns: Requires a VLM for retrieval transcription; storage size per session; rendering infrastructure. Likely P3 research until Zeph has a VLM integration path.

References

Metadata

Metadata

Assignees

Labels

P3Research — medium-high complexitymemoryzeph-memory crate (SQLite)researchResearch-driven improvement

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions