Official repository for the paper: OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agents.
-
[2026-01-13] 🎉 We have released the initial version of our paper, code, and project page.
-
[2026-01-04] 🎉 Congratulations: OS-Symphony has achieved a score of 65.8 on the OSWorld Official Evaluation (using GPT-5 + UI-TARS-1.5-7B with 50 steps). As of now, this ranks 5th overall, 3rd among methods without multiple rollout, and 1st under the 50-steps constraint!
Note: The evaluation results reported in our paper are lower due to limitations within the virtual machine environment. While you are allowed to compare against the metrics in our paper, we highly encourage comparing against the official evaluation results.
OS-Symphony is a holistic framework designed to address the robustness and generalization challenges faced by current Computer-Using Agents (CUAs). It introduces an Orchestrator that coordinates two key innovations:
- Reflection-Memory Agent (RMA): Utilizes milestone-driven long-term memory and a structured message protocol to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks.
- Versatile Tool Agents: Features a Multimodal Searcher that adopts a "SeeAct" paradigm to navigate the web and synthesize live, visually aligned tutorials, resolving fidelity issues in out-of-distribution scenarios.
By synergizing these components, OS-Symphony achieves robust automation across diverse operating systems and complex workflows.
OS-Symphony establishes new SOTA performance across three major benchmarks.
| Backbone | Steps | Success Rate |
|---|---|---|
| GPT-5 | 100 | 65.8% |
| GPT-5 | 50 | 63.6% |
| GPT-5-Mini | 50 | 58.1% |
| Qwen3-VL-32B-Thinking | 50 | 50.2% |
| Qwen3-VL-32B-Instruct | 50 | 46.9% |
| Backbone | Steps | Success Rate |
|---|---|---|
| GPT-5 | 50 | 63.5% |
| GPT-5-Mini | 50 | 62.2% |
| Qwen3-VL-32B-Thinking | 50 | 46.0% |
| Qwen3-VL-32B-Instruct | 50 | 45.3% |
| Backbone | Steps | Success Rate |
|---|---|---|
| GPT-5-Mini | 50 | 46.0% |
| Qwen3-VL-32B-Instruct | 50 | 19.1% |
Note: Our framework empowers open-source models (e.g., Qwen3-VL series) to achieve competitive performance, significantly narrowing the gap with proprietary SOTA models.
Set up the runtime virtual environment and install the necessary browser engines:
# Install Python dependencies
pip install -r requirements.txt
# Install Playwright browser binaries
playwright installConfiguring the Virtual Machine environments is a critical step. Please strictly follow the instructions in SETUP.md/SETUP_zh.md to download resources and configure the Golden Images for Linux, Windows, and MacOS.
Launch the evaluation using the provided shell script. You will need to modify the parameters in crucial_scripts/run_os_symphony.sh to match your experiments:
bash crucial_scripts/run_os_symphony_docker.shWe now also support OSWorld Evaluation via AWS Cloud. You can skip Step 2. Instead, please configure your cloud services by referring to the official AWS documentation, and then run:
bash crucial_scripts/run_os_symphony_aws_official.shKey Configuration Parameters:
| Parameter | Description |
|---|---|
path_to_vm |
Path to the VM Golden Image."/path/to/mac_hdd_ng.img /path/to/BaseSystem.img" |
searcher_path_to_vm |
Path to the Linux Search Environment image (/path/to/Ubuntu.qcow2). |
num_envs |
Number of concurrent processes for parallel evaluation. This primarily depends on your machine's resources and the throughput of the backend model. |
proxy |
Network proxy URL (Format: http://<ip>:<port>). Required for OSWorld and WindowsAgentArena. |
client_password |
VM login password. Use "password" for OSWorld(Docker), "osworld-public-evaluation" for OSWorld(AWS Cloud) and "1234" for MacOSArena. WindowsAgentArena does not need password. |
| Parameter | Description |
|---|---|
xx_provider,xx_model,xx_url,xx_api_key,xx_temperature |
Configuration for VLM inference (OpenAI-compatible API). We recommend using vLLM for open-source models. |
coder_budget, searcher_budget |
Maximum inner-loop iterations for the Coder and Searcher Agents, default is 20. |
searcher_engine |
Search engine provider. We recommend duckduckgo over Google to avoid CAPTCHA blocks. |
memoryer_max_images |
Maximum number of images retained in the Reflection-Memory Agent. |
grounding_smart_resize |
Enable for models requiring smart resizing (e.g., GTA1-32B, ScaleCUA series, UI-TARS-1.5). |
orchestrator_keep_first_image |
Whether to keep the initial screenshot in the context, default is True. |
tool_config |
Configuration for the action space, allowing dynamic assembly of tools. |
| Parameter | Description |
|---|---|
exp_name |
Name of the experiment (defines the results directory). |
enable_reflection |
Whether enable the Reflection-Memory Agent (RMA) module. |
max_steps |
Maximum number of steps allowed per task. |
benchmark |
Target benchmark: support osworld, waa, or macosarena. |
Results are saved in results/{exp_name} and logs in logs/{exp_name}.log.
To visualize the execution process and generate statistical reports, run the Gradio interface:
python gradio/gradio_show_result.py --root_dir results/{exp_name} --port 10000Then, you can open a webpage(http://0.0.0.0:10000) and check your trajectory per task:
-
Unified Cross-Platform Evaluation: We decouple the agent logic from the OS environment, providing a unified interface to evaluate agents across Linux, Windows, and MacOS seamlessly.
-
Enhanced Robustness: We have addressed numerous environment instability issues and bugs found in the original codebases of the supported benchmarks.
Important:This repository includes modifications to the OSWorld environment. If you wish to utilize a codebase identical to the official version for a fair comparison, please refer to our implementation submitted to the official OSWorld repository; alternatively, migrating it to our framework is straightforward. Please note that our official results were obtained using the official repository, while the all results reported in the paper are based on the current repository.
-
Extensibility: Support for defining more custom environments and tasks.
-
Custom Workflows: Flexible architecture allowing to customize Agent workflows and tool configurations.
We welcome the community to use our codebase for evaluating your own agents and tasks.
The core implementation of OS-Symphony is based on the Agent S series codebase; we extend our special thanks to them for their exceptional design. We also express our sincere gratitude to other pioneering projects for their contributions to GUI automation, including OSWorld, WindowsAgentArena, MacOSArena, UI-TARS series, GTA1, ScaleCUA, etc.
If you find this project useful in your research, please cite our paper:
@misc{yang2026ossymphony,
title={OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent},
author={Bowen Yang and Kaiming Jin and Zhenyu Wu and Zhaoyang Liu and Qiushi Sun and Zehao Li and JingJing Xie and Zhoumianze Liu and Fangzhi Xu and Kanzhi Cheng and Qingyun Li and Yian Wang and Yu Qiao and Zun Wang and Zichen Ding},
year={2026},
eprint={2601.07779},
archivePrefix={arXiv},
primaryClass={cs.MA},
url={https://arxiv.org/abs/2601.07779},
}

