MRAgent

This repository contains the code for the paper "Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents" (arXiv:2606.06036).

A retrieval-augmented question-answering system that builds a graph-structured episodic memory from long multi-session dialogues and answers questions through an LLM tool-calling reasoning loop. The system is evaluated on the LoCoMo and LongMemEval (LM) benchmarks.

The pipeline has two phases:

Phase 1 — Build the graph memory (once per conversation sample):

rewrite — rewrite each dialogue turn into a self-contained sentence: resolve pronouns to explicit entities, convert relative times to absolute YYYY-MM-DD dates, attach a topic tag, and extract topics and person-level facts.
extract_keyword — extract salient keywords for each rewritten sentence.
store — build the in-memory graph from the above: key nodes, episode / topic / personal events, and the links between them.

Phase 2 — Answer questions (per question):

answer — run a tool-calling reasoning loop (keyword / topic / personal / temporal / context tools) to produce a short final answer.

1. Repository Structure

run.py                    # main entry point
common/
    config.py             # CLI args, models, paths
    utils.py              # JSON + similarity helpers
    logging_utils.py      # per-sample logging
memory/
    system.py             # in-memory graph store
    controller.py         # graph query tools
llm/
    controller.py         # LLM tool-calling wrapper
    embeddings.py         # text-embedding client
    rag_utils.py          # batched embedding helper
agent/
    agent.py              # pipeline orchestration
    tools.py              # tool schemas + dispatch
prompts/
    prompts.py            # all LLM prompts
    schema.py             # output JSON validators
data/
    get_data.py           # load benchmark dataset
    embed_rewrite.py      # build embedding files
    dataset_locomo.json   # LoCoMo benchmark data
    dataset_LM.json       # LongMemEval benchmark data
eval/
    judge.py              # LLM-as-judge scorer
    evaluation.py         # F1 / EM helpers
    evaluate_reasoning.py # eval entry (F1 + judge)

2. Installation

Python 3.9+. The LongMemEval dataset (data/dataset_LM.json) is stored with Git LFS, so install Git LFS before cloning, otherwise that file arrives as a small pointer stub.

# install Git LFS once: https://git-lfs.com   (e.g. `apt install git-lfs` / `brew install git-lfs`)
git lfs install
git clone https://github.com/Ji-shuo/MRAgent.git
cd MRAgent

pip install -r requirements.txt

torch is used only for embedding tensor ops (L2 normalization); a CPU build is sufficient. nltk is required only by the eval/ scripts.

3. Configuration

All components — the chat LLM, the text-embedding model, and the LLM-as-judge evaluator — are accessed through a single OpenRouter key (OpenAI-compatible API). The key is read from a .env file at the repository root; no key is hard-coded.

Copy the template and fill in your key:

cp .env.example .env
# then edit .env:
# OPENROUTER_API_KEY=sk-or-v1-xxxxxxxx

.env is git-ignored. The same OPENROUTER_API_KEY is used everywhere:

Component	File	Model
Chat / reasoning	`common/config.py` (`--model` → `MODEL`)	e.g. `gemini` → `google/gemini-2.5-flash`
Embedding	`llm/embeddings.py`	`text-embedding-3-large` (3072-d)
LLM-as-judge	`eval/judge.py`	`openai/gpt-4o-mini`

4. Data Layout

The two benchmarks shipped in data/ come from:

LoCoMo (dataset_locomo.json) — Maharana et al., Evaluating Very Long-Term Conversational Memory of LLM Agents, ACL 2024. arXiv:2402.17753
LongMemEval (dataset_LM.json) — Wu et al., LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory, ICLR 2025. arXiv:2410.10813

Place the benchmark file at data/dataset_<name>.json. Generated intermediate artifacts and results are written under per-dataset subfolders:

data/<dataset>/rewrite_<model>/<sample_id>_rewrite.json   # stage 1 output
data/<dataset>/keyword_<model>/<sample_id>_keyword.json       # stage 3 output
data/<dataset>/embedding/gpt_<model>/<sample_id>_embedding.pkl# stage 2 output
result/<dataset>/<sample_id>_result_<model>_<file>.jsonl      # predictions (the only run output)

The run writes a single output per sample — the _result_*.jsonl predictions file (one JSON line per question: gold answer, prediction, category, evidence labels, retrieved support). A stage is skipped if its output file already exists, so generation runs once and subsequent runs reuse the cached rewrite / keyword / embedding files.

5. Usage

The single entry point is run.py, invoked from the repository root.

5.1 Arguments

Argument	Meaning	Default
`--data`	dataset name (`locomo` / `LM`)	`locomo`
`--model`	chat model short name (`gemini` / `claude` / `gpt4o` / `qwen`)	`gemini`
`--file`	run/experiment tag appended to result filenames	`0`
`--sample`	(LoCoMo) run a single sample id, e.g. `42`; omit to run all	`None`
`--ca`	(LM) category index: `0`=multi-session, `1`=single-session-user, `2`=temporal-reasoning	`1`
`--lm_batch`	(LM) sessions merged per rewrite call (`1` recommended)	`1`

5.2 LoCoMo

# all conversations
python run.py --data locomo --model gemini --file myrun

# a single conversation
python run.py --data locomo --model gemini --file myrun --sample 42

5.3 LongMemEval (LM)

--ca selects the question category (one run per category):

python run.py --data LM --model gemini --file myrun --ca 0 --lm_batch 10   # multi-session
python run.py --data LM --model gemini --file myrun --ca 1 --lm_batch 10   # single-session-user
python run.py --data LM --model gemini --file myrun --ca 2 --lm_batch 10   # temporal-reasoning

--lm_batch controls rewrite granularity. --lm_batch 1 (default) rewrites one session per LLM call and produces per-session records compatible with all downstream readers. Values >1 merge multiple sessions per call (range-keyed records, handled by the robust readers and the origin-prefixed graph store).

Each question is answered concurrently (10 worker threads); predictions stream to the result/<dataset>/ files and runs are resumable (already-answered questions are skipped on restart).

6. Evaluation

# F1 + LLM-as-judge accuracy (writes result_judge_<data>_<model>_<file>.jsonl)
python eval/evaluate_reasoning.py --data locomo --model gemini --file myrun --allfile

The LLM judge (eval/judge.py) grades a prediction CORRECT/WRONG against the gold answer with gpt-4o-mini, using lenient matching (topic overlap; date equivalence for temporal questions).

7. Notes

The pipeline is cache-based: delete the corresponding rewrite / keyword / embedding files to force regeneration of a sample.
Per-sample reasoning traces are logged under log/<dataset>/.
Tool inventory (7 tools): edges_by_tag, query_conversation_time, query_event_keywords, query_event_context, query_personal_information, query_personal_aspect, query_topic_events.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MRAgent

1. Repository Structure

2. Installation

3. Configuration

4. Data Layout

5. Usage

5.1 Arguments

5.2 LoCoMo

5.3 LongMemEval (LM)

6. Evaluation

7. Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agent		agent
common		common
data		data
eval		eval
llm		llm
memory		memory
prompts		prompts
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

MRAgent

1. Repository Structure

2. Installation

3. Configuration

4. Data Layout

5. Usage

5.1 Arguments

5.2 LoCoMo

5.3 LongMemEval (LM)

6. Evaluation

7. Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages