KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

KnowU-Bench is an online, interactive benchmark for evaluating personalized and proactive mobile agents in reproducible Android environments.

Overview

Mobile GUI agents have made rapid progress on explicit task execution, yet a deeper challenge remains: can an agent act on your behalf as if it truly understands you? KnowU-Bench is designed to measure exactly this. It goes beyond standard GUI benchmarks by evaluating three capabilities that existing work leaves unaddressed — inferring user preferences from behavioral history, eliciting missing preferences through multi-turn interaction, and deciding when to intervene, seek consent, or remain silent in proactive settings.

Key design principles:

Hidden profiles, exposed logs. The user profile is kept hidden from the agent; only timestamped behavioral logs are provided. This forces genuine preference inference rather than context lookup.
Online user simulator. An LLM-driven user simulator grounded in structured personas supports multi-turn clarification dialogues and proactive consent handling, enabling realistic agent-user interaction.
Full proactive decision chain. Tasks require agents to decide whether to act, seek confirmation, or remain silent — and to respect user rejection — under programmatic verification and LLM-as-Judge scoring.

Main findings from our paper: Agents that excel at explicit GUI execution degrade substantially once success depends on knowing the user or deciding whether to act at all. Personalized failures are dominated by weak preference acquisition, and proactive failures by miscalibrated intervention, revealing a fundamental gap between competent interface operation and trustworthy personal assistance.

📰 News

[2026-04-07] Code for KnowU-Bench is released.

📊 Benchmark Snapshot

Item	Value
Benchmark name	KnowU-Bench
App coverage	23 apps at benchmark scope
Registered tasks in current checkout	192
Task families	42 general, 86 personalized, 64 proactive
Agent-user interaction tasks	94 tasks tagged `agent-user-interaction`
User profiles	`developer`, `grandma`, `student`, `user`
Built-in agents	9

The current Python task registry directly references 17 app identifiers in this checkout. Evaluation combines textual answer verification, backend database checks, local storage inspection, application callbacks, and hybrid evaluation flows for personalized tasks.

🧩 Benchmark Structure

General tasks

General tasks evaluate direct end-to-end execution from natural language instructions.

Examples in the current codebase:

BirthdayWishGeneralTask
BuyComputerGeneralTask
CommuteLateWithNoticeGeneralTask
SearchTopInfoGeneralTask

Source directory: src/mobile_world/tasks/definitions/general

Personalized tasks

Personalized tasks test whether the agent can infer user preferences from profile fields, historical logs, and clarifying interaction. These tasks often require confirmation, comparison, or habit-sensitive decisions.

Examples in the current codebase:

OrderLunchTradeoffTask@user
BuyColaPreferenceTask@developer
ShareFavoritePhotosPreferenceAskUserTask@student
CalendarInviteConflictResolutionTask@user

Source directory: src/mobile_world/tasks/definitions/preference

Proactive tasks

Proactive tasks evaluate behavior grounded in recurring user habits. The agent must decide whether it should act, ask, wait, or stay silent based on the user profile and logs.

Examples in the current codebase:

WeekendSleeperTask@student
MorningPaperReadingTask@user
BatterySaverRoutineTask@developer
WeeklyReportRoutineTask@grandma

Source directory: src/mobile_world/tasks/definitions/routine

🚀 Installation

Requirements

Linux host with Docker
KVM acceleration for the Android emulator
Python 3.12
uv

If your Docker setup requires root permissions, prepend sudo to the mw env ... commands below.

Setup

git clone https://github.com/ZJU-REAL/KnowU-Bench.git
cd KnowU-Bench
uv sync
cp .env.example .env

Update .env with the credentials you actually need:

API_KEY: model API key for the mobile agent
USER_AGENT_API_KEY, USER_AGENT_BASE_URL, USER_AGENT_MODEL: user-agent configuration for interaction tasks

The default environment image in code is ghcr.io/zju-real/knowu_bench:latest.

⚡ Quick Start

1. Check host prerequisites

uv run mw env check

This verifies Docker, KVM, .env, and default image status.

2. Launch benchmark environments

uv run mw env run --count 4 --launch-interval 15

This starts four benchmark containers and exposes backend ports that mw eval can auto-discover.

3. Inspect tasks, agents, and apps

uv run mw info task --no-pager
uv run mw info agent
uv run mw info app

Useful variants:

uv run mw info task --name WeekendSleeperTask@student
uv run mw info task --filter lunch
uv run mw info task --export-excel artifacts/tasks.xlsx

4. Run an evaluation

The CLI still uses the code-level tags general, preference, and routine.

uv run mw eval \
  --agent-type qwen3.5 \
  --task ALL \
  --task-tags routine,preference,general \
  --model-name your-model-name \
  --llm-base-url https://your-openai-compatible-endpoint/v1 \
  --api-key "$API_KEY" \
  --max-round 50 \
  --max-concurrency 4 \
  --step-wait-time 3 \
  --log-file-root traj_logs/my_run \
  --enable-user-interaction

Important notes:

Add --enable-user-interaction when you want tasks that may ask or respond to the user.
Use --user student or another profile name to restrict evaluation to one persona.
Use --user-log-mode rag and --rag-backend embedding to inject only top-k relevant user-log snippets.
Use --user-log-source noise to evaluate robustness against noisy user histories.

5. View results

uv run mw logs results traj_logs/my_run
uv run mw logs view --log-dir traj_logs/my_run
uv run mw logs export --log-dir traj_logs/my_run -o exports/my_run

The log viewer gives you per-task trajectories, screenshots, actions, scores, and aggregate summaries.

🧰 Useful CLI Commands

mw env check: check Docker/KVM prerequisites and image status
mw env run: launch one or more benchmark containers
mw env list: list active containers
mw eval: run benchmark evaluation
mw test: run a single task for debugging
mw device: open the live Android device viewer
mw logs view: launch the interactive web log viewer
mw info task/agent/app: explore benchmark inventory

🤖 Built-In Agents

The current registry exposes these agent types:

gelab_agent, general_e2e, gui_owl_1_5, mai_ui_agent, planner_executor, qwen3.5, qwen3vl, seed_agent, ui_venus_agent

You can also pass a custom Python file path to --agent-type as long as it defines a class derived from BaseAgent.

📁 Repository Layout

src/mobile_world/tasks/definitions/      Benchmark task definitions
src/mobile_world/user_profile/           Structured user personas
src/mobile_world/user_logs/              Clean and noisy user histories
src/mobile_world/agents/implementations/ Built-in agent baselines
src/mobile_world/runtime/                Env client, controller, and app helpers
src/mobile_world/core/                   CLI, orchestration, server, log viewer
scripts/                                 Evaluation runners and metric calculators
docs/                                    Setup and development guides
site/                                    Website and leaderboard assets
assets/                                  Project figures used in the repository

🛠 Development

For development workflows, container restart behavior, VNC debugging, and source mounting, see:

A common dev workflow is:

uv run mw env run --dev --vnc
uv run mw env restart knowu_bench_env_0_dev
uv run mw env exec knowu_bench_env_0_dev

The scripts/ directory also contains batch runners and analysis helpers such as run_eval.sh, run_gpt_e2e.sh, calc_paper_metrics.py, and calc_pref_routine_accuracy.py.

⭐️ Citation

If you find this project useful, welcome to cite us.

@misc{chen2026knowubenchinteractiveproactivepersonalized,
      title={KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation}, 
      author={Tongbo Chen and Zhengxi Lu and Zhan Xu and Guocheng Shao and Shaohan Zhao and Fei Tang and Yong Du and Kaitao Song and Yizhou Liu and Yuchen Yan and Wenqi Zhang and Xu Tan and Weiming Lu and Jun Xiao and Yueting Zhuang and Yongliang Shen},
      year={2026},
      eprint={2604.08455},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2604.08455}, 
}

📄 License

This project is released under the Apache-2.0 License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
assets		assets
docker		docker
docs		docs
scripts		scripts
site		site
src/mobile_world		src/mobile_world
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

Overview

📰 News

📊 Benchmark Snapshot

🧩 Benchmark Structure

General tasks

Personalized tasks

Proactive tasks

🚀 Installation

Requirements

Setup

⚡ Quick Start

1. Check host prerequisites

2. Launch benchmark environments

3. Inspect tasks, agents, and apps

4. Run an evaluation

5. View results

🧰 Useful CLI Commands

🤖 Built-In Agents

📁 Repository Layout

🛠 Development

⭐️ Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

Overview

📰 News

📊 Benchmark Snapshot

🧩 Benchmark Structure

General tasks

Personalized tasks

Proactive tasks

🚀 Installation

Requirements

Setup

⚡ Quick Start

1. Check host prerequisites

2. Launch benchmark environments

3. Inspect tasks, agents, and apps

4. Run an evaluation

5. View results

🧰 Useful CLI Commands

🤖 Built-In Agents

📁 Repository Layout

🛠 Development

⭐️ Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages