Skip to content

zfkarl/Orchestra-o1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Orchestra-o1: Omnimodal Agent Orchestration

License: MIT Python 3.10+ Framework: verl

Orchestra-o1 is a multi-agent orchestration framework that decomposes complex omnimodal tasks into parallel subtasks, delegating them to specialized SubAgents for execution. It features a MainAgent + SubAgent architecture where the MainAgent (orchestrator) plans and coordinates, while SubAgents execute specific subtasks using various tools.

πŸ“° News

  • [2026.06] We uploaded the arXiv paper [πŸ“„ arXiv] and released the Orchestra-o1-8B model weights [πŸ€— Model].
  • [2026.05] Orchestra-o1 achieves 72.8% accuracy on the OmniGAIA benchmark, ranking 1st and outperforming the second-best method by 10.3 percentage points.

πŸ—οΈ Architecture

Orchestra-o1 Framework

Figure 1. Overview of the Orchestra-o1 framework. The MainAgent orchestrates multi-turn interactions by decomposing omnimodal tasks into independent/dependent subtasks, creating specialized SubAgents with perception tools (image, audio, video analysis) and action tools (web search, page visit, code execution), and executing them in parallel. An online sub-agent specialization module handles sub-task preparation, model selection, tool integration, and memory allocation.

πŸ“„ Paper

The paper is available at: [πŸ“„ arXiv]

πŸ“¦ Model Weights

The trained Orchestra-o1-8B model weights are available at: [πŸ€— Model]

✨ Key Features

  • 🎯 Hierarchical Multi-Agent Architecture: MainAgent orchestrates task decomposition; SubAgents execute subtasks with specialized tools
  • ⚑ Parallel Subtask Execution: Independent subtasks run simultaneously, maximizing throughput
  • πŸ”§ Rich Tool Ecosystem: Web search, code execution, video/audio/image analysis, URL extraction
  • 🧠 GRPO Training: Train open-source models (Qwen3-8B) as MainAgent using Group Relative Policy Optimization with LLM-as-judge reward
  • πŸ“Š OmniGAIA Benchmark: Comprehensive evaluation on omnimodal question-answering tasks

πŸ“ Project Structure

Orchestra-o1/
β”œβ”€β”€ bench_orchestra_o1_omnigaia.py    # πŸš€ Inference with commercial models (e.g., GPT-5)
β”œβ”€β”€ bench_qwen/                       # πŸš€ Inference with trained open-source models
β”‚   β”œβ”€β”€ run_qwen3_8b_grpo.sh          #    One-click: vLLM + benchmark (Qwen3-8B GRPO)
β”‚   β”œβ”€β”€ bench_qwen_omnigaia.py        #    Benchmark runner for Qwen3-8B
β”‚   β”œβ”€β”€ eval_qwen.py                  #    Evaluation report generator
β”‚   β”œβ”€β”€ model_config_qwen.yaml        #    Model config (local vLLM + commercial APIs)
β”‚   └── orchestra_o1_omnigaia_qwen_grpo.yaml  # Benchmark config
β”œβ”€β”€ benchmark/                        #    Benchmark framework
β”‚   β”œβ”€β”€ common/                       #    Runner, environment abstractions
β”‚   └── omnigaia/                     #    OmniGAIA benchmark implementation & tools
β”œβ”€β”€ base/                             #    Base framework
β”‚   β”œβ”€β”€ agent/                        #    Agent abstractions (BaseAgent, Memory, ReAct)
β”‚   └── engine/                       #    LLM engine, cost monitoring, logging
β”œβ”€β”€ orchestra_o1/                     # 🎡 Core orchestration framework
β”‚   β”œβ”€β”€ main_agent.py                 #    MainAgent (orchestrator)
β”‚   β”œβ”€β”€ config.py                     #    Configuration loader
β”‚   β”œβ”€β”€ prompts/                      #    Prompt templates (OmniGAIA)
β”‚   β”œβ”€β”€ runners/                      #    Benchmark runners
β”‚   β”œβ”€β”€ subagents/                    #    SubAgent implementations (ReAct)
β”‚   └── tools/                        #    Orchestration tools (delegate, complete, trace)
β”œβ”€β”€ train_qwen3_8b/                   # πŸ‹οΈ GRPO Training
β”‚   β”œβ”€β”€ grpo/                         #    GRPO training pipeline
β”‚   β”‚   β”œβ”€β”€ train_grpo_qwen3_8b.sh    #    Training script (8Γ—H20 GPUs)
β”‚   β”‚   └── reward_fn.py              #    LLM-as-judge multi-dimensional reward
β”‚   └── ds_config.json                #    DeepSpeed config
β”œβ”€β”€ config/                           #    Configuration files
β”‚   β”œβ”€β”€ model_config.yaml             #    LLM API configuration
β”‚   └── benchmarks/                   #    Benchmark configurations
β”œβ”€β”€ eval/                             #    Evaluation scripts
β”œβ”€β”€ requirements.txt                  #    Python dependencies
└── README.md

πŸš€ Quick Start

1. Installation

git clone https://github.com/zfkarl/Orchestra-o1.git
cd Orchestra-o1

# Install dependencies
pip install -r requirements.txt

# Create .env file and fill in your API keys (see "Environment Variables" section below)
touch .env

2. Configure Models

Edit config/model_config.yaml with your LLM API credentials:

models:
  "gpt-5":
    api_type: "openai"
    base_url: "https://api.openai.com/v1/"
    api_key: "your_api_key_here"

3. Prepare Dataset

Download the OmniGAIA dataset and place it under data/OmniGAIA/.

4. Run Inference

Mode A: Commercial Model (GPT-5) as MainAgent

python bench_orchestra_o1_omnigaia.py --config config/benchmarks/orchestra_o1_omnigaia.yaml

Mode B: Trained Open-Source Model (Qwen3-8B GRPO) as MainAgent

# One-click: starts vLLM server β†’ waits for ready β†’ runs benchmark β†’ generates report
VLLM_MODEL_PATH=/path/to/your/grpo/checkpoint/huggingface \
CUDA_VISIBLE_DEVICES=0 \
bash bench_qwen/run_qwen3_8b_grpo.sh

πŸ‹οΈ GRPO Training

Train Qwen3-8B as the MainAgent using GRPO (Group Relative Policy Optimization) with an LLM-as-judge reward function.

Orchestra-o1-8B Training Pipeline

Figure 2. Orchestra-o1-8B training pipeline. (a) Training Data Curation: Starting from seed data, we run Orchestra-o1 (GPT-5) to collect trajectories, extract anchor facts across modalities, apply QA rewrites (pivot swapping, temporal shifting, etc.), and filter & verify to produce 1.2K high-quality training samples. (b) DA-GRPO Training: We reconstruct decision examples from expert trajectories, sample G candidate decisions from the base model (Qwen3-8B), score each on 4 dimensions via a rubric reward (format, action, tool, decision quality), compute relative advantages, and optimize with DA-GRPO to produce Orchestra-o1-8B.

Prerequisites

  • Hardware: 8Γ— H20 (96GB) GPUs (single node)
  • Software: verl framework
  • Data: GPT-5 expert trajectories from OmniGAIA benchmark

Training Pipeline

cd train_qwen3_8b/grpo

# Launch GRPO training
VERL_DIR=/path/to/verl MODEL_PATH=/path/to/Qwen3-8B \
bash train_grpo_qwen3_8b.sh

Reward Function

The reward function (reward_fn.py) uses LLM-as-judge (claude-haiku-4-5) to evaluate 4 dimensions:

Dimension Weight Range Description
format_correct 0.10 {0, 1} JSON format correctness
action_valid 0.10 {0, 1} Action validity (delegate_task / complete)
tool_reasonable 0.20 [0, 1] Tool selection & subtask assignment quality
decision_quality 0.60 β˜… [0, 1] Overall decision quality (references GPT-5 expert)

score = 0.10 Γ— format + 0.10 Γ— action + 0.20 Γ— tool + 0.60 Γ— decision ∈ [0, 1]

MainAgent Decision Flow

  1. Analyze the question and omnimodal inputs
  2. Decompose into independent subtasks (Phase 1)
  3. Delegate all independent subtasks in parallel
  4. Evaluate results β€” sufficient? β†’ complete; need more? β†’ plan Phase 2
  5. Iterate until answer is found or budget exhausted

SubAgent (ReAct)

Each SubAgent follows the ReAct (Reasoning + Acting) paradigm:

  • Think: Reason about the current state
  • Act: Use tools (search, code execution, media analysis)
  • Observe: Process tool outputs
  • Repeat until task is complete

πŸ“Š Evaluation

After running the benchmark, generate evaluation reports:

# For commercial model results
python eval/eval.py --main_agent gpt-5

# For Qwen3-8B GRPO results
python bench_qwen/eval_qwen.py --csv_path logs/omnigaia_qwen_8b_grpo/omnigaia_qwen_xxx.csv --main_agent qwen3-8b-grpo

πŸ”§ Configuration

Environment Variables (.env)

Create a .env file in the project root directory with the following content:

# ======== OmniGAIA Benchmark Tools ========
# Jina API - used for web content extraction (ExtractUrlContentAction)
JINA_API_KEY=your_jina_api_key_here

# Serper API - used for Google Search (GoogleSearchAction)
SERPER_API_KEY=your_serper_api_key_here
SERPER_BASE_URL=https://google.serper.dev/search

# ======== LLM API Config ========
# Can also be configured in config/model_config.yaml
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_BASE_URL=https://api.openai.com/v1/
Variable Description
OPENAI_API_KEY OpenAI API key (for GPT-5, GPT-4o, etc.)
OPENAI_BASE_URL OpenAI API base URL
JINA_API_KEY Jina API key (for web content extraction)
SERPER_API_KEY Serper API key (for Google search)
SERPER_BASE_URL Serper API base URL

Key Training Hyperparameters

Parameter Default Description
TRAIN_BSZ 24 Training batch size
ROLLOUT_N 8 GRPO group size
MAX_PROMPT_LEN 24576 Max prompt length (tokens)
MAX_RESP_LEN 4096 Max response length (tokens)
LR 5e-6 Learning rate
EPOCHS 5 Number of training epochs
TP_SIZE 8 vLLM tensor parallel size

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • verl β€” Reinforcement learning framework for LLM training
  • vLLM β€” High-throughput LLM serving engine
  • OmniGAIA β€” Omnimodal benchmark dataset

πŸ“š Citation

If you find Orchestra-o1 useful, please cite our paper:

@misc{zhang2026orchestrao1omnimodalagentorchestration,
      title={Orchestra-o1: Omnimodal Agent Orchestration}, 
      author={Fan Zhang and Vireo Zhang and Shengju Qian and Haoxuan Li and Hao Wu and Jinyang Wu and Donghao Zhou and Zhihong Zhu and Zheng Lian and Xin Wang and Pheng-Ann Heng},
      year={2026},
      eprint={2606.13707},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.13707}, 
}

About

Official Repository of Orchestra-o1: Omnimodal Agent Orchestration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors