Skip to content

OS-Copilot/OS-Symphony

Repository files navigation

OS-Symphony Logo

OS-Symphony

A Holistic Framework for Robust and Generalist Computer-Using Agents

Official repository for the paper: OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agents.

arXiv 🌐 Homepage License

📑 Table of Contents

🗞️ Updates

  • [2026-01-13] 🎉 We have released the initial version of our paper, code, and project page.

  • [2026-01-04] 🎉 Congratulations: OS-Symphony has achieved a score of 65.8 on the OSWorld Official Evaluation (using GPT-5 + UI-TARS-1.5-7B with 50 steps). As of now, this ranks 5th overall, 3rd among methods without multiple rollout, and 1st under the 50-steps constraint!

    Note: The evaluation results reported in our paper are lower due to limitations within the virtual machine environment. While you are allowed to compare against the metrics in our paper, we highly encourage comparing against the official evaluation results.

💡 Overview

OS-Symphony is a holistic framework designed to address the robustness and generalization challenges faced by current Computer-Using Agents (CUAs). It introduces an Orchestrator that coordinates two key innovations:

  1. Reflection-Memory Agent (RMA): Utilizes milestone-driven long-term memory and a structured message protocol to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks.
  2. Versatile Tool Agents: Features a Multimodal Searcher that adopts a "SeeAct" paradigm to navigate the web and synthesize live, visually aligned tutorials, resolving fidelity issues in out-of-distribution scenarios.

By synergizing these components, OS-Symphony achieves robust automation across diverse operating systems and complex workflows.

📊 Results

OS-Symphony establishes new SOTA performance across three major benchmarks.

🐧 OSWorld-Verified (Ubuntu)

Backbone Steps Success Rate
GPT-5 100 65.8%
GPT-5 50 63.6%
GPT-5-Mini 50 58.1%
Qwen3-VL-32B-Thinking 50 50.2%
Qwen3-VL-32B-Instruct 50 46.9%

🪟 WindowsAgentArena(Windows)

Backbone Steps Success Rate
GPT-5 50 63.5%
GPT-5-Mini 50 62.2%
Qwen3-VL-32B-Thinking 50 46.0%
Qwen3-VL-32B-Instruct 50 45.3%

🍎 MacOSArena(MacOS)

Backbone Steps Success Rate
GPT-5-Mini 50 46.0%
Qwen3-VL-32B-Instruct 50 19.1%

Note: Our framework empowers open-source models (e.g., Qwen3-VL series) to achieve competitive performance, significantly narrowing the gap with proprietary SOTA models.

🛠️ Environment & Setup

1. Installation

Set up the runtime virtual environment and install the necessary browser engines:

# Install Python dependencies
pip install -r requirements.txt

# Install Playwright browser binaries
playwright install

2. VM Configuration

Configuring the Virtual Machine environments is a critical step. Please strictly follow the instructions in SETUP.md/SETUP_zh.md to download resources and configure the Golden Images for Linux, Windows, and MacOS.

3. Running Evaluation

Launch the evaluation using the provided shell script. You will need to modify the parameters in crucial_scripts/run_os_symphony.sh to match your experiments:

bash crucial_scripts/run_os_symphony_docker.sh

We now also support OSWorld Evaluation via AWS Cloud. You can skip Step 2. Instead, please configure your cloud services by referring to the official AWS documentation, and then run:

bash crucial_scripts/run_os_symphony_aws_official.sh

Key Configuration Parameters:

🖥️ Environment Settings

Parameter Description
path_to_vm Path to the VM Golden Image.
⚠️ For MacOSArena: Must be two paths separated by a space: "/path/to/mac_hdd_ng.img /path/to/BaseSystem.img"
searcher_path_to_vm Path to the Linux Search Environment image (/path/to/Ubuntu.qcow2).
num_envs Number of concurrent processes for parallel evaluation. This primarily depends on your machine's resources and the throughput of the backend model.
proxy Network proxy URL (Format: http://<ip>:<port>). Required for OSWorld and WindowsAgentArena.
client_password VM login password. Use "password" for OSWorld(Docker), "osworld-public-evaluation" for OSWorld(AWS Cloud) and "1234" for MacOSArena. WindowsAgentArena does not need password.

🤖 Agent Settings

Parameter Description
xx_provider,xx_model,xx_url,xx_api_key,xx_temperature Configuration for VLM inference (OpenAI-compatible API). We recommend using vLLM for open-source models.
coder_budget, searcher_budget Maximum inner-loop iterations for the Coder and Searcher Agents, default is 20.
searcher_engine Search engine provider. We recommend duckduckgo over Google to avoid CAPTCHA blocks.
memoryer_max_images Maximum number of images retained in the Reflection-Memory Agent.
grounding_smart_resize Enable for models requiring smart resizing (e.g., GTA1-32B, ScaleCUA series, UI-TARS-1.5).
orchestrator_keep_first_image Whether to keep the initial screenshot in the context, default is True.
tool_config Configuration for the action space, allowing dynamic assembly of tools.

🧪 Experiment Settings

Parameter Description
exp_name Name of the experiment (defines the results directory).
enable_reflection Whether enable the Reflection-Memory Agent (RMA) module.
max_steps Maximum number of steps allowed per task.
benchmark Target benchmark: support osworld, waa, or macosarena.

4. Visualization

Results are saved in results/{exp_name} and logs in logs/{exp_name}.log.

To visualize the execution process and generate statistical reports, run the Gradio interface:

python gradio/gradio_show_result.py --root_dir results/{exp_name} --port 10000

Then, you can open a webpage(http://0.0.0.0:10000) and check your trajectory per task:

✨ Features

  1. Unified Cross-Platform Evaluation: We decouple the agent logic from the OS environment, providing a unified interface to evaluate agents across Linux, Windows, and MacOS seamlessly.

  2. Enhanced Robustness: We have addressed numerous environment instability issues and bugs found in the original codebases of the supported benchmarks.

    Important:This repository includes modifications to the OSWorld environment. If you wish to utilize a codebase identical to the official version for a fair comparison, please refer to our implementation submitted to the official OSWorld repository; alternatively, migrating it to our framework is straightforward. Please note that our official results were obtained using the official repository, while the all results reported in the paper are based on the current repository.

  3. Extensibility: Support for defining more custom environments and tasks.

  4. Custom Workflows: Flexible architecture allowing to customize Agent workflows and tool configurations.

We welcome the community to use our codebase for evaluating your own agents and tasks.

😊 Acknowledgement

The core implementation of OS-Symphony is based on the Agent S series codebase; we extend our special thanks to them for their exceptional design. We also express our sincere gratitude to other pioneering projects for their contributions to GUI automation, including OSWorld, WindowsAgentArena, MacOSArena, UI-TARS series, GTA1, ScaleCUA, etc.

📃 Citation

If you find this project useful in your research, please cite our paper:

@misc{yang2026ossymphony,
      title={OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent}, 
      author={Bowen Yang and Kaiming Jin and Zhenyu Wu and Zhaoyang Liu and Qiushi Sun and Zehao Li and JingJing Xie and Zhoumianze Liu and Fangzhi Xu and Kanzhi Cheng and Qingyun Li and Yian Wang and Yu Qiao and Zun Wang and Zichen Ding},
      year={2026},
      eprint={2601.07779},
      archivePrefix={arXiv},
      primaryClass={cs.MA},
      url={https://arxiv.org/abs/2601.07779}, 
}

About

Official repository for paper: OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agents

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •