OS-Symphony

A Holistic Framework for Robust and Generalist Computer-Using Agents

Official repository for the paper: OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agents.

📑 Table of Contents

🗞️ Updates
💡 Overview
📊 Results
🛠️ Environment & Setup
✨ Features
😊 Acknowledgement
📃 Citation

🗞️ Updates

[2026-01-13] 🎉 We have released the initial version of our paper, code, and project page.
[2026-01-04] 🎉 Congratulations: OS-Symphony has achieved a score of 65.8 on the OSWorld Official Evaluation (using GPT-5 + UI-TARS-1.5-7B with 50 steps). As of now, this ranks 5th overall, 3rd among methods without multiple rollout, and 1st under the 50-steps constraint！

Note: The evaluation results reported in our paper are lower due to limitations within the virtual machine environment. While you are allowed to compare against the metrics in our paper, we highly encourage comparing against the official evaluation results.

💡 Overview

OS-Symphony is a holistic framework designed to address the robustness and generalization challenges faced by current Computer-Using Agents (CUAs). It introduces an Orchestrator that coordinates two key innovations:

Reflection-Memory Agent (RMA): Utilizes milestone-driven long-term memory and a structured message protocol to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks.
Versatile Tool Agents: Features a Multimodal Searcher that adopts a "SeeAct" paradigm to navigate the web and synthesize live, visually aligned tutorials, resolving fidelity issues in out-of-distribution scenarios.

By synergizing these components, OS-Symphony achieves robust automation across diverse operating systems and complex workflows.

📊 Results

OS-Symphony establishes new SOTA performance across three major benchmarks.

🐧 OSWorld-Verified (Ubuntu)

Backbone	Steps	Success Rate
GPT-5	100	65.8%
GPT-5	50	63.6%
GPT-5-Mini	50	58.1%
Qwen3-VL-32B-Thinking	50	50.2%
Qwen3-VL-32B-Instruct	50	46.9%

🪟 WindowsAgentArena（Windows）

Backbone	Steps	Success Rate
GPT-5	50	63.5%
GPT-5-Mini	50	62.2%
Qwen3-VL-32B-Thinking	50	46.0%
Qwen3-VL-32B-Instruct	50	45.3%

🍎 MacOSArena（MacOS）

Backbone	Steps	Success Rate
GPT-5-Mini	50	46.0%
Qwen3-VL-32B-Instruct	50	19.1%

Note: Our framework empowers open-source models (e.g., Qwen3-VL series) to achieve competitive performance, significantly narrowing the gap with proprietary SOTA models.

🛠️ Environment & Setup

1. Installation

Set up the runtime virtual environment and install the necessary browser engines：

# Install Python dependencies
pip install -r requirements.txt

# Install Playwright browser binaries
playwright install

2. VM Configuration

Configuring the Virtual Machine environments is a critical step. Please strictly follow the instructions in SETUP.md/SETUP_zh.md to download resources and configure the Golden Images for Linux, Windows, and MacOS.

3. Running Evaluation

Launch the evaluation using the provided shell script. You will need to modify the parameters in crucial_scripts/run_os_symphony.sh to match your experiments：

bash crucial_scripts/run_os_symphony_docker.sh

We now also support OSWorld Evaluation via AWS Cloud. You can skip Step 2. Instead, please configure your cloud services by referring to the official AWS documentation, and then run：

bash crucial_scripts/run_os_symphony_aws_official.sh

Key Configuration Parameters:

🖥️ Environment Settings

Parameter	Description
`path_to_vm`	Path to the VM Golden Image. ⚠️ For MacOSArena: Must be two paths separated by a space: `"/path/to/mac_hdd_ng.img /path/to/BaseSystem.img"`
`searcher_path_to_vm`	Path to the Linux Search Environment image (`/path/to/Ubuntu.qcow2`).
`num_envs`	Number of concurrent processes for parallel evaluation. This primarily depends on your machine's resources and the throughput of the backend model.
`proxy`	Network proxy URL (Format: `http://<ip>:<port>`). Required for OSWorld and WindowsAgentArena.
`client_password`	VM login password. Use `"password"` for OSWorld(Docker), `"osworld-public-evaluation"` for OSWorld(AWS Cloud) and `"1234"` for MacOSArena. WindowsAgentArena does not need password.

🤖 Agent Settings

Parameter	Description
`xx_provider，xx_model，xx_url，xx_api_key，xx_temperature`	Configuration for VLM inference (OpenAI-compatible API). We recommend using vLLM for open-source models.
`coder_budget`, `searcher_budget`	Maximum inner-loop iterations for the Coder and Searcher Agents, default is 20.
`searcher_engine`	Search engine provider. We recommend `duckduckgo` over Google to avoid CAPTCHA blocks.
`memoryer_max_images`	Maximum number of images retained in the Reflection-Memory Agent.
`grounding_smart_resize`	Enable for models requiring smart resizing (e.g., GTA1-32B, ScaleCUA series, UI-TARS-1.5).
`orchestrator_keep_first_image`	Whether to keep the initial screenshot in the context, default is True.
`tool_config`	Configuration for the action space, allowing dynamic assembly of tools.

🧪 Experiment Settings

Parameter	Description
`exp_name`	Name of the experiment (defines the results directory).
`enable_reflection`	Whether enable the Reflection-Memory Agent (RMA) module.
`max_steps`	Maximum number of steps allowed per task.
`benchmark`	Target benchmark: support `osworld`, `waa`, or `macosarena`.

4. Visualization

Results are saved in results/{exp_name} and logs in logs/{exp_name}.log.

To visualize the execution process and generate statistical reports, run the Gradio interface:

python gradio/gradio_show_result.py --root_dir results/{exp_name} --port 10000

Then, you can open a webpage(http://0.0.0.0:10000) and check your trajectory per task:

✨ Features

Unified Cross-Platform Evaluation: We decouple the agent logic from the OS environment, providing a unified interface to evaluate agents across Linux, Windows, and MacOS seamlessly.
Enhanced Robustness: We have addressed numerous environment instability issues and bugs found in the original codebases of the supported benchmarks.

Important：This repository includes modifications to the OSWorld environment. If you wish to utilize a codebase identical to the official version for a fair comparison, please refer to our implementation submitted to the official OSWorld repository; alternatively, migrating it to our framework is straightforward. Please note that our official results were obtained using the official repository, while the all results reported in the paper are based on the current repository.
Extensibility: Support for defining more custom environments and tasks.
Custom Workflows: Flexible architecture allowing to customize Agent workflows and tool configurations.

We welcome the community to use our codebase for evaluating your own agents and tasks.

😊 Acknowledgement

The core implementation of OS-Symphony is based on the Agent S series codebase; we extend our special thanks to them for their exceptional design. We also express our sincere gratitude to other pioneering projects for their contributions to GUI automation, including OSWorld, WindowsAgentArena, MacOSArena, UI-TARS series, GTA1, ScaleCUA, etc.

📃 Citation

If you find this project useful in your research, please cite our paper：

@misc{yang2026ossymphony,
      title={OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent}, 
      author={Bowen Yang and Kaiming Jin and Zhenyu Wu and Zhaoyang Liu and Qiushi Sun and Zehao Li and JingJing Xie and Zhoumianze Liu and Fangzhi Xu and Kanzhi Cheng and Qingyun Li and Yian Wang and Yu Qiao and Zun Wang and Zichen Ding},
      year={2026},
      eprint={2601.07779},
      archivePrefix={arXiv},
      primaryClass={cs.MA},
      url={https://arxiv.org/abs/2601.07779}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
crucial_scripts		crucial_scripts
desktop_env		desktop_env
evaluation_examples		evaluation_examples
gradio		gradio
mm_agents/os_symphony		mm_agents/os_symphony
.gitignore		.gitignore
README.md		README.md
SETUP.md		SETUP.md
SETUP_zh.md		SETUP_zh.md
lib_run_single.py		lib_run_single.py
requirements.txt		requirements.txt
run_os_symphony.py		run_os_symphony.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OS-Symphony

A Holistic Framework for Robust and Generalist Computer-Using Agents

📑 Table of Contents

🗞️ Updates

💡 Overview

📊 Results

🐧 OSWorld-Verified (Ubuntu)

🪟 WindowsAgentArena（Windows）

🍎 MacOSArena（MacOS）

🛠️ Environment & Setup

1. Installation

2. VM Configuration

3. Running Evaluation

🖥️ Environment Settings

🤖 Agent Settings

🧪 Experiment Settings

4. Visualization

✨ Features

😊 Acknowledgement

📃 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

OS-Copilot/OS-Symphony

Folders and files

Latest commit

History

Repository files navigation

OS-Symphony

A Holistic Framework for Robust and Generalist Computer-Using Agents

📑 Table of Contents

🗞️ Updates

💡 Overview

📊 Results

🐧 OSWorld-Verified (Ubuntu)

🪟 WindowsAgentArena（Windows）

🍎 MacOSArena（MacOS）

🛠️ Environment & Setup

1. Installation

2. VM Configuration

3. Running Evaluation

🖥️ Environment Settings

🤖 Agent Settings

🧪 Experiment Settings

4. Visualization

✨ Features

😊 Acknowledgement

📃 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages