- 💾 Installation
- 🚀 Quick Start
- 📁 Benchmark Session
- 📊 Metrics
- 🤖 Adding a New Agent
- 📤 Leaderboard Submission
- 📚 Tasks
- 📝 Citation
- 📧 Contact
- 2026-06-21: 🏆 Updated MemGUI-Bench results for recently released frontier models, including Kimi-K2.6, Gemini-3.1-Pro-Preview, and Seed-2.0-Pro. Kimi-K2.6 sets a new SOTA on the leaderboard.
- 2026-06-19: 🚀 MemGUI-Agent is released, bringing memory-augmented mobile GUI agents to long-horizon phone tasks.
- 2026-06-16: 📣 Preview: MemGUI-Agent shows promising results on long-horizon GUI agent tasks. The leaderboard has been updated with evaluation results and trajectory previews. Paper is coming!
- 2026-06-11: 🚀 Refactoring MemGUI-Bench to a MobileWorld-style runtime and trajectory viewer. We will release more frontier model evaluation results on MemGUI-Bench soon!
- 2026-02-15: 🎉 MemGUI-Bench adopted by Mobile-Agent-v3.5! Congrats to the Tongyi Lab team for achieving 27.1% on Easy tasks with GUI-Owl-1.5-32B. We welcome more agents to challenge the full benchmark! 🚀
- 2026-02-09: 🗂️ Benchmark tasks now available on HuggingFace: lgy0404/MemGUI-Bench
- 2026-02-09: 📄 Paper released on arXiv! Check out our paper: arXiv:2602.06075
- 2026-02-03: Initial release of MemGUI-Bench benchmark. Check out our website.
- Linux host with Docker and KVM acceleration
- Permission to run privileged Docker containers
- Python 3.12 and
uvon the host
The default Docker runtime image already includes the Android SDK, ADB, emulator binaries, MemGUI-AVD snapshot, and MobileWorld-compatible MemGUI-Bench runtime. Users do not need to install Android Studio, download AVD snapshots, build a local runtime image, or configure emulator paths.
# Install dependencies with uv
uv sync
# Create local .env from the example
uv run mg env inituv run mg env init creates .env from .env.example. If you prefer to create
the environment file manually:
cp .env.example .envEdit the .env file and configure the following parameters.
Required for Agent Evaluation:
BASE_URL: OpenAI-compatible base URL for the agent modelAPI_KEY: API key for the agent model
Required for MemGUI-Eval:
MEMGUI_API_KEY: API key for MemGUI-EvalMEMGUI_STEP_DESC_MODEL: Step-description modelMEMGUI_STEP_DESC_BASE_URL: Optional step-description endpoint; leave empty to useBASE_URLMEMGUI_FINAL_DECISION_MODEL: Final-decision modelMEMGUI_FINAL_DECISION_BASE_URL: Optional final-decision endpoint; leave empty to useBASE_URL
Example .env file:
# Agent model configuration
BASE_URL=https://openrouter.fans/v1
API_KEY=YOUR_API_KEY_HERE
# MemGUI-Eval configuration
MEMGUI_API_KEY=YOUR_API_KEY_HERE
# Step description model
MEMGUI_STEP_DESC_MODEL=google/gemini-2.5-flash
MEMGUI_STEP_DESC_BASE_URL=
# Final decision model
MEMGUI_FINAL_DECISION_MODEL=google/gemini-2.5-pro
MEMGUI_FINAL_DECISION_BASE_URL=For leaderboard submissions, we use MEMGUI_STEP_DESC_MODEL=google/gemini-2.5-flash
and MEMGUI_FINAL_DECISION_MODEL=google/gemini-2.5-pro to keep evaluation
fair across submissions. During debugging, you may use other compatible models
to reduce cost or latency.
Note:
mg env runmounts local.envinto each container.mg evalruns on the host and writes trajectories directly into localtraj_logs/.
sudo uv run mg env checksudo uv run mg env run --count 2This launches 2 ready MemGUI backend containers with:
--count 2: Number of parallel containers--launch-interval 30: Default wait time between container launches--emulator-timeout 1200: Default timeout for MemGUI AVD cold start
Each backend runs one Android emulator. Backend ports start at
http://localhost:6800, viewer ports start at http://localhost:7860, ADB
ports start at 5556. Trajectory logs are written by the host-side mg eval
process into local traj_logs/.
For a larger run, launch more containers and match mg eval --max-concurrency
to the number of healthy backends, for example --count 4 --max-concurrency 4.
Optional: if your network requires an outbound proxy, export it before launching
containers. mg env run forwards these variables to both the container runtime
and the Android emulator:
export http_proxy=http://proxy.example.com:8080
export https_proxy=http://proxy.example.com:8080
export no_proxy='localhost,127.0.0.1,localaddress,localdomain.com,internal,.corp.example.com,.staging.example.com,0,1,2,3,4,5,6,7,8,9'
sudo -E uv run mg env run --count 2You can also pass --http-proxy, --https-proxy, and --no-proxy directly to
mg env run if your sudo configuration does not preserve environment
variables.
sudo uv run mg eval \
--agent-type qwen3vl \
--model-name qwen3-vl-8b \
--task ALL \
--log-file-root traj_logs/memgui-qwen3vl \
--max-concurrency 2mg eval --max-concurrency 2 discovers two MemGUI backend containers and feeds
the selected tasks through MobileWorld's environment queue. Each backend runs
exactly one Android emulator and writes MobileWorld-format trajectories into
the local traj_logs/ directory.
uv run mg logs view --log-dir traj_logs/memgui-qwen3vlThe viewer opens a local web UI with task-level status, screenshots, action
traces, model predictions, and result.txt scores in the MobileWorld layout.
For a single-container debug shell:
sudo uv run mg env exec 0
uv run mg eval \
--agent-type qwen3vl \
--model-name qwen3-vl-8b \
--task 001-FindProductAndFilter \
--aw-host http://localhost:6800 \
--log-file-root traj_logs/debug| Command | Description |
|---|---|
sudo uv run mg env check |
Check Docker/KVM/.env and pull the default prebuilt runtime image |
sudo uv run mg env build |
Optional: build a local MobileWorld-compatible runtime image from the MemGUI base image |
sudo uv run mg env run |
Launch backend container(s) with local .env mounted |
sudo uv run mg env list |
List MemGUI-Bench containers |
sudo uv run mg env exec |
Open a shell or run a command in a container for debugging |
sudo uv run mg env rm |
Remove MemGUI-Bench containers |
uv run mg env init |
Create .env from .env.example |
uv run mg server |
Run the backend service inside a container; normally started by mg env run |
sudo uv run mg eval |
Run execution/evaluation across MemGUI containers |
uv run mg info task |
List or filter benchmark tasks |
uv run mg info agent |
List configured agents |
uv run mg info app |
Show app-level task counts |
uv run mg logs view |
Launch the interactive trajectory viewer |
uv run mg logs results |
Print the same compact MemGUI progress and summary metrics as logs view (Evaluating, P@k, IRR, MTPR, FRR) |
uv run mg logs export |
Export a static HTML trajectory site |
| Argument | Default | Description |
|---|---|---|
--agent-type |
required | Registered MobileWorld agent name or custom agent path |
--model-name |
.env/agent default |
Agent model name |
--llm-base-url |
.env/agent default |
OpenAI-compatible base URL |
--api-key |
API_KEY |
Agent API key |
--task / --tasks |
all when omitted | Task id(s), comma-separated, or ALL |
--task-file / --task-csv |
none | MemGUI CSV subset to run, e.g.data/memgui-tasks-40.csv |
--difficulty / --task-difficulty |
none | MemGUI difficulty filter:easy/medium/hard, 1/2/3, or 简单/中等/困难; comma-separated values are supported |
--pass-at-k / --attempts |
1 |
Run each MemGUI task until one attempt succeeds or K attempts are exhausted, then aggregate pass@K |
--suite-family |
memgui_bench |
Benchmark suite family |
--log-file-root |
./traj_logs |
Local root for MobileWorld trajectory logs |
--aw-host |
auto | Comma-separated backend URL(s); auto-discovered when omitted |
--max-round / --max-step |
MemGUI task budget | Maximum agent steps per task; omitted uses int(golden_steps * 2.5 + 1), -1 means unlimited |
--step-wait-time |
3.0 |
Seconds to wait after each action before the next screenshot for MemGUI-Bench |
--timeout |
none | Optional per-task timeout in seconds; timed-out tasks are recorded as failed and the run continues |
--max-concurrency |
number of containers | Maximum concurrent tasks |
--llm-max-concurrency |
MEMGUI_LLM_MAX_CONCURRENCY or 2 |
Maximum concurrent LLM API calls across running tasks |
--llm-rate-limit-retries |
MEMGUI_LLM_RATE_LIMIT_RETRIES or 20 |
Retries for transient LLM API failures such as 429, 5xx, timeout, or connection errors |
--llm-rate-limit-max-wait |
MEMGUI_LLM_RATE_LIMIT_MAX_WAIT or 120 |
Maximum backoff wait in seconds for transient LLM API failures |
--llm-infra-retries |
MEMGUI_LLM_INFRA_RETRIES or 3 |
Infra-only reruns for the same pass@k attempt before marking the task as no-result; these reruns do not consume pass@k attempts |
--shuffle-tasks |
false | Shuffle task order before scheduling |
--dry-run |
false | Resolve tasks/backends without execution |
Transient API failures and device recovery failures are treated as infrastructure
failures, not model failures. If they exceed the retry budget, MemGUI-Bench writes
an _infra_failures/ record and leaves the task as no-result for resume.
# Full benchmark (execution + evaluation)
uv run mg eval --agent-type qwen3vl --model-name qwen3-vl-8b --task ALL --log-file-root traj_logs/qwen3vl-full
# Run specific task
uv run mg eval --agent-type qwen3vl --model-name qwen3-vl-8b --task 001-FindProductAndFilter --log-file-root traj_logs/debug
# Run the 40-task subset
uv run mg eval --agent-type qwen3vl --model-name qwen3-vl-8b --task-file data/memgui-tasks-40.csv --log-file-root traj_logs/qwen3vl-40
# Run only hard MemGUI tasks
uv run mg eval --agent-type qwen3vl --model-name qwen3-vl-8b --difficulty hard --log-file-root traj_logs/qwen3vl-hard
# Run medium + hard tasks from the 40-task subset
uv run mg eval --agent-type qwen3vl --model-name qwen3-vl-8b --task-file data/memgui-tasks-40.csv --difficulty medium,hard --log-file-root traj_logs/qwen3vl-40-medium-hard
# Run pass@3 on the 40-task subset
uv run mg eval --agent-type qwen3vl --model-name qwen3-vl-8b --task-file data/memgui-tasks-40.csv --pass-at-k 3 --log-file-root traj_logs/qwen3vl-40-pass3
# Use explicit backends
uv run mg eval --agent-type qwen3vl --task ALL --aw-host http://localhost:6800,http://localhost:6801
# Limit concurrency
uv run mg eval --agent-type qwen3vl --task ALL --max-concurrency 2
# Dry run
uv run mg eval --agent-type qwen3vl --task 001-FindProductAndFilter --dry-run# Interactive web viewer
uv run mg logs view --log-dir traj_logs/qwen3vl-full --port 8760
# Terminal summary
uv run mg logs results traj_logs/qwen3vl-full
# Static HTML export for sharing or archiving
uv run mg logs export \
--log-dir traj_logs/qwen3vl-full \
--output exported-sites/qwen3vl-fullFor pass@K runs, the task detail page includes attempt tabs. Attempt 1 is stored
in the canonical task folder; later attempts are stored under _attempt_trajs/
and can be opened from the same viewer page.
Each run creates an isolated benchmark folder under local traj_logs/. The
host-side mg eval process writes these files directly, so they are not trapped
inside Docker containers.
- Each task has a MobileWorld
traj.json, screenshots, marked screenshots, andresult.txt - Re-running the same log root skips tasks that already succeeded; pass@K runs also skip tasks that already have a completed pass@K aggregate result
- MemGUI-Eval receives a generated compatibility workspace under
_memgui_eval/
Click to expand output directory structure
traj_logs/qwen3vl-full/
├── metadata.json
├── 001-FindProductAndFilter/
│ ├── traj.json
│ ├── result.txt
│ ├── thread_<id>.log
│ ├── screenshots/
│ │ └── 001-FindProductAndFilter-0-1.png
│ └── marked_screenshots/
│ └── marked-001-FindProductAndFilter-0-1.png
├── _attempt_trajs/
│ └── 001-FindProductAndFilter/
│ └── attempt_2/
│ ├── traj.json
│ ├── result.txt
│ └── screenshots/
└── _memgui_eval/
├── results.csv
└── 001-FindProductAndFilter/
└── qwen3vl/
└── attempt_1/
├── log.json
├── 0.png, 1.png, ...
├── final_decision.json
└── evaluation_summary.json
The benchmark automatically computes:
| Metric | Description |
|---|---|
| Pass@K | Success rate within K attempts |
| IRR | Information Retrieval Rate (memory accuracy) |
| FRR | Failure Recovery Rate (learning from errors) |
| MTPR | Memory Task Performance Ratio |
| Step Ratio | Agent steps / Golden steps |
| Time/Step | Average execution time per step |
| Cost/Step | API cost per step (if applicable) |
MemGUI-Eval details are saved under _memgui_eval/; MobileWorld-facing scores are
written to each task's result.txt.
MemGUI-Bench now uses MobileWorld's agent interface. Add or reuse an agent under
src/mobile_world/agents/implementations/, then register it in
src/mobile_world/agents/registry.py.
Agents receive MobileWorld observations and return a prediction string plus a
JSONAction. Android action execution, screenshots, trajectory logging, and
parallel scheduling are handled by the shared MobileWorld runtime.
After running the benchmark:
Create or update a metadata JSON under docs/data/agents/:
{
"name": "YourAgent",
"backbone": "GPT-4V",
"type": "Agentic Workflow",
"institution": "Your Institution",
"date": "2026-02-03",
"paperLink": "https://arxiv.org/...",
"codeLink": "https://github.com/...",
"trajFile": "trajs/your-agent-name.json.gz",
"hasUITree": true,
"hasLongTermMemory": false
}Submit via Pull Request to lgy0404/MemGUI-Bench → docs/data/agents/
Use trajFile only when you also submit the matching trajectory preview pair.
Generate the static trajectory preview bundle and submit the two output files via PR to lgy0404/memgui-bench-trajs:
# Generate the preview bundle from your local run
python3 docs/bundle_trajs.py traj_logs/memgui-run-name \
-o docs/trajs/your-agent-name.json.gz \
--with-screenshots
# This creates:
# docs/trajs/your-agent-name.json.gz
# docs/trajs/your-agent-name.mp4
# Upload via HuggingFace Web UI:
# 1. Go to https://huggingface.co/datasets/lgy0404/memgui-bench-trajs
# 2. Click "Community" → "New Pull Request" → "Upload files"
# 3. Upload both files to site/trajs/ and submit the PRUse the same lowercase hyphenated your-agent-name as your
docs/data/agents/your-agent-name.json file. Maintainers will review the pair
and update the public trajectory manifest after acceptance.
See submission guide for details.
| File | Tasks | Description |
|---|---|---|
memgui-tasks-all.csv |
128 | Full benchmark |
memgui-tasks-40.csv |
40 | Subset for quick testing |
Task Fields (click to expand)
task_identifiertask_descriptiontask_appnum_appsrequires_ui_memorytask_difficultygolden_steps
@article{liu2026memgui,
title={MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments},
author={Liu, Guangyi and Zhao, Pengxiang and Liang, Yaozhen and Luo, Qinyi and Tang, Shunye and Chai, Yuxiang and Lin, Weifeng and Xiao, Han and Wang, WenHao and Chen, Siheng and others},
journal={arXiv preprint arXiv:2602.06075},
year={2026}
}For questions, issues, or collaborations, please contact: guangyiliu@zju.edu.cn
If you find MemGUI-Bench helpful, please consider giving us a star ⭐!


