KnowU-Bench is an online, interactive benchmark for evaluating personalized and proactive mobile agents in reproducible Android environments.
Mobile GUI agents have made rapid progress on explicit task execution, yet a deeper challenge remains: can an agent act on your behalf as if it truly understands you? KnowU-Bench is designed to measure exactly this. It goes beyond standard GUI benchmarks by evaluating three capabilities that existing work leaves unaddressed β inferring user preferences from behavioral history, eliciting missing preferences through multi-turn interaction, and deciding when to intervene, seek consent, or remain silent in proactive settings.
Key design principles:
- Hidden profiles, exposed logs. The user profile is kept hidden from the agent; only timestamped behavioral logs are provided. This forces genuine preference inference rather than context lookup.
- Online user simulator. An LLM-driven user simulator grounded in structured personas supports multi-turn clarification dialogues and proactive consent handling, enabling realistic agent-user interaction.
- Full proactive decision chain. Tasks require agents to decide whether to act, seek confirmation, or remain silent β and to respect user rejection β under programmatic verification and LLM-as-Judge scoring.
Main findings from our paper: Agents that excel at explicit GUI execution degrade substantially once success depends on knowing the user or deciding whether to act at all. Personalized failures are dominated by weak preference acquisition, and proactive failures by miscalibrated intervention, revealing a fundamental gap between competent interface operation and trustworthy personal assistance.
- [2026-04-07] Code for KnowU-Bench is released.
| Item | Value |
|---|---|
| Benchmark name | KnowU-Bench |
| App coverage | 23 apps at benchmark scope |
| Registered tasks in current checkout | 192 |
| Task families | 42 general, 86 personalized, 64 proactive |
| Agent-user interaction tasks | 94 tasks tagged agent-user-interaction |
| User profiles | developer, grandma, student, user |
| Built-in agents | 9 |
The current Python task registry directly references 17 app identifiers in this checkout. Evaluation combines textual answer verification, backend database checks, local storage inspection, application callbacks, and hybrid evaluation flows for personalized tasks.
General tasks evaluate direct end-to-end execution from natural language instructions.
Examples in the current codebase:
BirthdayWishGeneralTaskBuyComputerGeneralTaskCommuteLateWithNoticeGeneralTaskSearchTopInfoGeneralTask
Source directory: src/mobile_world/tasks/definitions/general
Personalized tasks test whether the agent can infer user preferences from profile fields, historical logs, and clarifying interaction. These tasks often require confirmation, comparison, or habit-sensitive decisions.
Examples in the current codebase:
OrderLunchTradeoffTask@userBuyColaPreferenceTask@developerShareFavoritePhotosPreferenceAskUserTask@studentCalendarInviteConflictResolutionTask@user
Source directory: src/mobile_world/tasks/definitions/preference
Proactive tasks evaluate behavior grounded in recurring user habits. The agent must decide whether it should act, ask, wait, or stay silent based on the user profile and logs.
Examples in the current codebase:
WeekendSleeperTask@studentMorningPaperReadingTask@userBatterySaverRoutineTask@developerWeeklyReportRoutineTask@grandma
Source directory: src/mobile_world/tasks/definitions/routine
- Linux host with Docker
- KVM acceleration for the Android emulator
- Python 3.12
- uv
If your Docker setup requires root permissions, prepend sudo to the mw env ... commands below.
git clone https://github.com/ZJU-REAL/KnowU-Bench.git
cd KnowU-Bench
uv sync
cp .env.example .envUpdate .env with the credentials you actually need:
API_KEY: model API key for the mobile agentUSER_AGENT_API_KEY,USER_AGENT_BASE_URL,USER_AGENT_MODEL: user-agent configuration for interaction tasks
The default environment image in code is ghcr.io/zju-real/knowu_bench:latest.
uv run mw env checkThis verifies Docker, KVM, .env, and default image status.
uv run mw env run --count 4 --launch-interval 15This starts four benchmark containers and exposes backend ports that mw eval can auto-discover.
uv run mw info task --no-pager
uv run mw info agent
uv run mw info appUseful variants:
uv run mw info task --name WeekendSleeperTask@student
uv run mw info task --filter lunch
uv run mw info task --export-excel artifacts/tasks.xlsxThe CLI still uses the code-level tags general, preference, and routine.
uv run mw eval \
--agent-type qwen3.5 \
--task ALL \
--task-tags routine,preference,general \
--model-name your-model-name \
--llm-base-url https://your-openai-compatible-endpoint/v1 \
--api-key "$API_KEY" \
--max-round 50 \
--max-concurrency 4 \
--step-wait-time 3 \
--log-file-root traj_logs/my_run \
--enable-user-interactionImportant notes:
- Add
--enable-user-interactionwhen you want tasks that may ask or respond to the user. - Use
--user studentor another profile name to restrict evaluation to one persona. - Use
--user-log-mode ragand--rag-backend embeddingto inject only top-k relevant user-log snippets. - Use
--user-log-source noiseto evaluate robustness against noisy user histories.
uv run mw logs results traj_logs/my_run
uv run mw logs view --log-dir traj_logs/my_run
uv run mw logs export --log-dir traj_logs/my_run -o exports/my_runThe log viewer gives you per-task trajectories, screenshots, actions, scores, and aggregate summaries.
mw env check: check Docker/KVM prerequisites and image statusmw env run: launch one or more benchmark containersmw env list: list active containersmw eval: run benchmark evaluationmw test: run a single task for debuggingmw device: open the live Android device viewermw logs view: launch the interactive web log viewermw info task/agent/app: explore benchmark inventory
The current registry exposes these agent types:
gelab_agent, general_e2e, gui_owl_1_5, mai_ui_agent, planner_executor, qwen3.5, qwen3vl, seed_agent, ui_venus_agent
You can also pass a custom Python file path to --agent-type as long as it defines a class derived from BaseAgent.
src/mobile_world/tasks/definitions/ Benchmark task definitions
src/mobile_world/user_profile/ Structured user personas
src/mobile_world/user_logs/ Clean and noisy user histories
src/mobile_world/agents/implementations/ Built-in agent baselines
src/mobile_world/runtime/ Env client, controller, and app helpers
src/mobile_world/core/ CLI, orchestration, server, log viewer
scripts/ Evaluation runners and metric calculators
docs/ Setup and development guides
site/ Website and leaderboard assets
assets/ Project figures used in the repository
For development workflows, container restart behavior, VNC debugging, and source mounting, see:
A common dev workflow is:
uv run mw env run --dev --vnc
uv run mw env restart knowu_bench_env_0_dev
uv run mw env exec knowu_bench_env_0_devThe scripts/ directory also contains batch runners and analysis helpers such as run_eval.sh, run_gpt_e2e.sh, calc_paper_metrics.py, and calc_pref_routine_accuracy.py.
If you find this project useful, welcome to cite us.
@misc{chen2026knowubenchinteractiveproactivepersonalized,
title={KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation},
author={Tongbo Chen and Zhengxi Lu and Zhan Xu and Guocheng Shao and Shaohan Zhao and Fei Tang and Yong Du and Kaitao Song and Yizhou Liu and Yuchen Yan and Wenqi Zhang and Xu Tan and Weiming Lu and Jun Xiao and Yueting Zhuang and Yongliang Shen},
year={2026},
eprint={2604.08455},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2604.08455},
}
This project is released under the Apache-2.0 License. See LICENSE for details.

