Android Bench is a framework for benchmarking Large Language Models (LLMs) on Android development tasks. It evaluates an AI model's ability to understand mobile codebases, generate accurate patches, and solve Android-specific engineering problems.
The repository provides the tooling to evaluate a model's ability to act as an Android developer. It takes an issue description, generates code modifications, and verifies those changes against a test suite in a standardized environment using a curated dataset.
- x86_64 with KVM-capabilities
- Python 3.14+
- uv (Fast Python package installer)
- Docker
- API keys for the models to benchmark
Note that using local images (v1 limitation) is disk and memory intensive, with base, repo, and task imags sometimes requiring +40GB of free space each.
git clone https://github.com/android-bench/android-bench.git
cd android-bench
# Create and activate the virtual environment
uv venv
source .venv/bin/activate
# Run the setup script
uv run setup_envThe setup_env takes care of the following:
- Ensuring all dependencies are installed.
- Configures the oracle agent with golden patches for testing.
- Generates the
summary.jsonfor the dataset explorer. - Detects your host architecture (x86/AMD64 or ARM64) and builds the Docker images or exits gracefully if incompatible.
You must configure your API keys to use the supported models. Our inference agent is based on mini-swe-agent which by default supports all models using LiteLLM.
Before you use a model, export its corresponding API key as an environment variable:
export GEMINI_API_KEY="your-api-key"
export OPENAI_API_KEY="your-api-key"You can also run the setup script mini-extra config setup.
If you run into authentication issues, we recommend you check their troubleshooting guide.
Always include the provider in the model name, e.g.,
gemini/gemini-...
The benchmarking process has two stages:
- Inference (Agent): The agent reads the issue description and generates a code patch.
- Evaluation (Verifier): The verifier applies the patch and runs tests to score the solution.
To browse available tasks or understand the dataset structure:
# Launch the interactive explorer
datasetThis launches an interactive wizard. You can also run specific subcommands directly:
dataset browse --category composedataset inspect <task_id>
For filtering and usage, see the Task Visualizer Guide.
Note on Docker Images: Tasks run in isolated Docker containers. The very first time you run a task, the framework will build task-specific Docker images locally based on the dataset configurations.
This initial cold-start can take 5-10+ minutes. Subsequent runs should be significantly faster.
macOS / ARM64 Users: The Android SDK does not provide an ARM64 Linux emulator package. Executing the benchmark locally on macOS Docker Desktop is severely limited due to the lack of nested virtualization (KVM). See the Troubleshooting guide for workarounds.
To run the complete pipeline (inference and evaluation) for a specific task ID:
run_task --model gemini/gemini-2.5-flash --task android_snippets_1The Oracle Agent applies the known canonical solutions to verify that the evaluation infrastructure works as expected.
# Setup the oracle agent (setup_env handles this)
# Run the verifier in test mode
verifier --test-run --run-name oracle-agentThis project uses pytest for unit and integration testing. Run the CI test suite with:
pytest --log-cli-level=INFO --verboseYou must have a Gemini API key configured for the test suite to pass.
To visualize the results, you can use the HTML summary, generated with the following command:
results --input-dir ourRemember to change the input-dir to the directory of your choice if you decide to store the results elsewhere.
For more comprehensive guides and architectural details, refer to the following resources:
- User Guide: Comprehensive instructions on CLI commands, framework setup, benchmark dataset, and harness architecture.
- Viewing and Interpreting Results: Guide to locating output files, understanding the
scores.jsonschema, and deciphering diagnostic status codes. - Troubleshooting Guide: Solutions for common issues like Docker build failures, compilation errors, and missing patch generations.
- Technical Report: A deep-dive into the methodology, dataset construction, and baseline results.
We are currently exploring ways to engage with the open-source community. As we prepare the repository for broader contributions, we highly encourage you to provide feedback via the issue tracker. Please see our Contributing Guidelines for more details.
Android Bench is licensed under the Apache License, Version 2.0. See the LICENSE file for details.