Skip to content

xlang-ai/OSWorld-V2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

624 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OSWorld banner

Website - Paper - Doc - Data - Trajectory Viewer - Trajectory Download

PRs welcome Last commit Apache 2.0 license

πŸ“’ Updates

πŸ’Ύ Setup Evaluation Environment

Migrating from OSWorld 1.0

If you are coming from xlang-ai/OSWorld, see Migrating from OSWorld 1.0 to OSWorld 2.0 for the full checklist covering dependency setup, task conversion, provider reuse, AWS notes, mocked websites, GitLab, agent migration, and result comparability.

Setup with Agent

Clone this repository:

git clone https://github.com/xlang-ai/OSWorld-V2
cd OSWorld-V2

We provide a skill called setup-osworld. You can use your favorite agent to help you set up with a single prompt!

Sample prompt:

Use $setup-osworld to provision this OSWorld 2.0 checkout after clone. Ask me first which supported provider and optional services I want, create or verify the required infrastructure where possible, ask before any cloud spend, DNS, SSH, or secret step, then report what is configured versus blocked and give me the final export commands I need to run OSWorld.

Setup Manually

Package Setup

git clone https://github.com/xlang-ai/OSWorld-V2
cd OSWorld-V2
uv sync

The project metadata requires Python >= 3.12 through pyproject.toml. Optional full dependencies for heavier agents or OCR/model stacks (only for v1 tasks) can be installed with:

uv sync --extra full

Environment Provider Setup

OSWorld 2.0 can run with different environment providers depending on where you test:

  • Docker for Linux servers, especially hosts with KVM support.
  • AWS for large-scale parallel evaluation or training infrastructure.

Select the matching provider in runner commands with --provider_name. The environment provider layer also includes VMware, Azure, GCP, Aliyun, and Volcengine code paths. Currently we only provide images for docker and AWS provider, if you want to use other providers, see migrating from 1.0 to update 1.0 image to 2.0.

For AWS, make sure the client VM security group allows the V2 task service ports 3000 and 8000 in addition to the standard OSWorld backend and control ports documented in Provider Setup.

Mocked Website and GitLab Setup

Some OSWorld 2.0 tasks use mocked websites and GitLab.

For mocked websites, you can use the services hosted by the OSWorld 2.0 team:

export WEBSITE_HOST_SUFFIX="web.hku.icu"

You can also self-host the mocked websites by following Task-Web/OSWorld-web, then setting WEBSITE_HOST_SUFFIX to your own host suffix.

For GitLab, you must self-host because GitLab-backed tasks require a private token, and exposing a shared hosted token creates a security risk. Follow Task-Web/gitlab, then set the matching environment variables before running tasks:

export GITLAB_URL="<your-gitlab-url>"
export GITLAB_PRIVATE_TOKEN="<your-private-token>"

Download Gated Task Classes

The official OSWorld 2.0 Python task classes are distributed through the gated Hugging Face dataset xlangai/osworld_v2_tasks, not directly in this public GitHub checkout. This reduces benchmark leakage and helps prevent evaluated agents from finding task answers, setup logic, or evaluator details online while executing a task.

To download them:

  1. Visit https://huggingface.co/datasets/xlangai/osworld_v2_tasks while logged in to Hugging Face and accept the gated access request, your request will automatically be approved.
  2. Log in locally:
uvx --from huggingface_hub hf auth login
  1. Download or update the task classes:
uv run scripts/tools/download_osworld_v2_tasks.py

To download the task files for a specific benchmark release, pass the release manifest name or JSON path:

uv run scripts/tools/download_osworld_v2_tasks.py --benchmark-release osworld-v2-2026.06.24

Official benchmark release manifests live in benchmark_releases/. See Benchmark Releases for how releases pin task, website, code, and provider-image versions for comparable runs.

The script writes files to evaluation_examples/task_class, overwrites existing task_*.py files when re-run, and removes stale local task files by default. Use --dry-run to preview changes or --keep-stale to keep local task files that are no longer present in the gated dataset.

See task class download notes for details.

Proxy Configuration

Some tasks may require proxy settings depending on website defenses and network location. See Proxy Guideline.

  • Impact of Missing Configuration: If these settings are missing, corresponding tasks may fail and lower the evaluation score.

πŸ§ͺ Evaluation

Run Sample

For a concrete multi-environment Claude run, start from scripts/bash/run_multienv_claude.sh. Before executing it, fill in every environment variable required by your provider, model endpoint, and selected tasks. At minimum, check the AWS variables, model API key environment variable, OSWorld client password, website host suffix, and any GitLab credentials needed by the task set.

The sample script exports the required variables near the top, then calls scripts/python/run_multienv_claude.py through uv. Adjust options such as --result_dir, --test_all_meta_path, --num_envs, --provider_name, and model settings for your run.

# Edit scripts/bash/run_multienv_claude.sh first:
# - AWS_REGION, AWS_SUBNET_ID, AWS_SECURITY_GROUP_ID
# - AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
# - ANTHROPIC_API_KEY, or the API key named by OSWORLD_EVAL_MODEL_API_KEY_ENV
# - OSWORLD_CLIENT_PASSWORD
# - WEBSITE_HOST_SUFFIX
# - GITLAB_URL and GITLAB_PRIVATE_TOKEN, if the selected tasks use GitLab

bash scripts/bash/run_multienv_claude.sh

Manual Task Examination

Use scripts/python/manual_examine.py when you want to inspect a specific benchmark task by hand, verify task behavior, debug evaluator metrics, or record a short manual trajectory with screenshots and videos.

uv run python scripts/python/manual_examine.py \
  --headless \
  --provider_name aws \
  --observation_type screenshot \
  --result_dir ./results_human_examine \
  --test_config_base_dir evaluation_examples \
  --domain tasks \
  --eval_version v2 \
  --example_id 146 \
  --max_steps 3

Local Evaluation

Correctly implement the agent interface and import your customized version in the run.py (for single-threaded execution) or scripts/python/run_multienv.py / scripts/python/run_multienv_xxx.py (for parallel execution) file. For implementation references, you can follow the current Claude or GPT implementations, such as scripts/python/run_multienv_claude.py and scripts/python/run_multienv_gpt_response_api.py.

If you are migrating an agent from OSWorld 1.0, you can use the migrate-osworld-agent skill to help port and test the agent. Sample migration prompt:

Use $migrate-osworld-agent to migrate my OSWorld 1.0 agent into this OSWorld 2.0 checkout. Read the upstream agent <agent file path> and the local Claude/GPT runner patterns, add the adapted agent under mm_agents/, create matching multi-env Python and bash entrypoints, then run the fast checks and one small smoke test without staging unrelated local files.

Afterward, you can execute a command similar to the one in the previous section to run the benchmark on your agent.

Public Evaluation

If you want your results to be verified and displayed on the verified leaderboard, you need to schedule a meeting with us (current maintainer: yuanmengqi732@gmail.com, zzl0712@connect.hku.hk) to run your agent code on our side and have us report the results. You need to upload and allow us to disclose your agent implementation under the OSWorld framework (you may choose not to expose your model API to the public), along with a report that allows the public to understand what's happening behind the scenes. Alternatively, if you are from a trusted institution, you can share your monitoring data and trajectories with us. Please carefully follow the Setup Guideline - Public Evaluation Platform to get results.

❓ FAQ

What is the username and password for the virtual machines?

For all providers in OSWorld 2.0, the Ubuntu account credentials are user / osworld-public-evaluation.

If you modify credentials, pass the correct client_password to DesktopEnv and to agents that require it. Some setup and task actions require sudo privileges inside the VM.

How do I configure a proxy for the VM?

If you are behind a restricted network, or if some tasks are blocked by website defenses, see Proxy Guideline. The public evaluation guide also describes a pre-configured proxy option.

Where are task files stored?

OSWorld 2.0 task classes are downloaded into evaluation_examples/task_class with scripts/tools/download_osworld_v2_tasks.py. Large initial files, ground-truth files, and other task assets are typically stored in Hugging Face dataset cache repositories such as xlangai/osworld_v2_assets.

πŸ“š Citation

If you find this environment useful, please consider citing OSWorld:

@misc{osworld2,
    title = {OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks},
    author = {Mengqi Yuan and Zilong Zhou and Xinzhuang Xiong and Weiming Wu and Jiayang Sun and Jiamin Song and Kaiqian Cui and Bowen Wang and Haoyuan Wu and Yitong Li and Dunjie Lu and Haikong Lu and Qi Zhen and Xinyuan Wang and Jiaqi Deng and Yuhao Yang and Cheng Chen and Boyuan Zheng and Alex Su and Xiao Yu and Hao Zou and Saaket Agashe and Xing Han L{\"u} and Manpreet Kaur and Yi Liang and Junli Wang and Zhengyang Qi and Vincent Sunn Chen and Frederic Sala and Dayiheng Liu and Junyang Lin and Zhou Yu and Yu Su and Siva Reddy and Xin Eric Wang and Peng Qi and Tianbao Xie and Tao Yu},
    year = {2026}
}

Acknowledgement for OSWorld 2.0

We thank Cheng Chang, Dawn Song, Delin Chen, Ke Xu, Qiyue Xu, Ruiling Xu, Shengwei Wang, Yanzhuo Lin, Yimo Cai, Yiyong Sun, and Yutong Yao for their helpful feedback and for contributing materials to this benchmark. We thank Snorkel AI, our research & data partner, for their support of this work. We gratefully acknowledge support from the Google Research gift fund.