Skip to content

QwenLM/Qwen-AgentWorld

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Qwen-AgentWorld

Qwen-AgentWorld

Welcome to the GitHub repository of Qwen-AgentWorld. Here, you can find official information about Qwen-AgentWorld, post your questions (Issues), and share your ideas with the community (Discussions).

News

  • 2026-06-24: We release Qwen-AgentWorld-35B-A3B and AgentWorldBench. Read more on our blog and the technical report.

Open-Source Release

We open-source Qwen-AgentWorld-35B-A3B (model weights) and AgentWorldBench (evaluation benchmark):

Release Description
Qwen-AgentWorld-35B-A3B Language world model (MoE, 35B total / 3B active, 256K context)
AgentWorldBench Evaluation benchmark across 7 domains

The official weights and data are released on:

  • 🤗 HuggingFace: Download automatically via model ID, e.g., Qwen/Qwen-AgentWorld-35B-A3B. You can also download model files manually using huggingface download or git clone. Please follow the instructions on the model page.
  • 🤖 ModelScope: For users unable to access Hugging Face Hub. For supported frameworks, you can download from ModelScope by setting environment variables, such as SGLANG_USE_MODELSCOPE=true or VLLM_USE_MODELSCOPE=true.

Introduction

Qwen-AgentWorld Overview

Qwen-AgentWorld is a native language world model that simulates agentic environments via long chain-of-thought reasoning across seven unified domains: MCP, Search, Terminal, SWE, Android, Web, and OS. It is trained through a three-stage pipeline -- CPT injects environment knowledge, SFT activates next-state-prediction reasoning, RL sharpens simulation fidelity -- on more than 10M real-world interaction trajectories. Unlike prior approaches that treat world modeling as a post-hoc add-on, Qwen-AgentWorld is a native world model: environment modeling is the training objective from the CPT stage onward.

Key features:

  • Seven Unified Domains. The first language world model to cover seven agent interaction domains within a single model.
  • Native World Model. Environment modeling from CPT onward, not post-hoc adaptation.
  • Generalizable, Scalable & Controllable Simulator. Zero-shot generalization to OOD environments (e.g., Claw Agent); controllable perturbations and fictional-world construction surpass real-environment training.
  • Agent Foundation Model. LWM RL warm-up on single-turn, non-agentic trajectories transfers to multi-turn, tool-calling agentic tasks across seven benchmarks, including three entirely out-of-domain.

Performance

AgentWorldBench Results

Five-dimensional rubric mean (↑) per domain, normalized to 0--100 scale.

Model MCP Search Term. SWE Android Web OS Overall
GPT-5.4 70.10 37.26 53.69 66.29 60.00 51.80 68.58 58.25
Claude Opus 4.8 54.93 35.14 59.18 64.10 61.50 54.66 66.62 56.59
Claude Opus 4.6 69.90 29.30 57.51 64.55 61.74 51.42 70.20 57.80
Gemini 3.1 Pro 59.07 30.21 52.47 59.07 61.40 52.83 66.92 54.57
Claude Sonnet 4.6 70.00 28.79 56.98 64.52 58.03 50.78 63.17 56.04
DeepSeek-V4-Pro 63.27 27.61 51.26 59.44 55.17 50.32 63.70 52.97
GLM-5.1 67.60 22.46 47.32 52.07 59.10 51.50 59.13 51.31
Kimi K2.6 65.23 27.48 52.54 58.77 58.93 50.20 60.80 53.42
MiniMax-M2.7 55.82 27.30 41.62 37.44 52.40 50.52 57.73 46.12
Qwen3.5-35B-A3B 57.87 25.98 46.13 47.58 53.18 47.10 56.27 47.73
Qwen3.5-397B-A17B 68.31 30.81 55.30 64.44 54.90 48.55 60.85 54.74
Qwen3.6-Plus 55.28 21.94 50.58 59.08 57.65 50.78 60.33 50.81
Qwen-AgentWorld-35B-A3B 64.79 36.69 53.96 65.63 58.17 49.55 65.92 56.39
Qwen-AgentWorld-397B-A17B 68.24 37.82 57.73 68.49 60.20 50.98 67.89 58.71

Qwen-AgentWorld-397B-A17B achieves the highest overall score (58.71), outperforming all frontier proprietary models including GPT-5.4 (58.25). Qwen-AgentWorld-35B-A3B shows +8.66 improvement over Qwen3.5-35B-A3B without LWM training.

Applications

Generalizable Environment Scaling. Sim RL with Qwen-AgentWorld-397B-A17B on 4k out-of-distribution OpenClaw environments:

Model Claw-Eval QwenClawBench
Qwen3.5-35B-A3B 65.4 47.9
+ Sim RL (w/ Qwen3.6-Plus) 66.7 47.8
+ Sim RL (w/ Qwen-AgentWorld-397B-A17B) 69.7 55.0
Δ +4.3 +7.1

Controllable Simulation: MCP. Environment Adaptation --- Control instructions inject targeted perturbations to expose agent weaknesses:

Model Tool Decathlon MCPMark
Qwen3.5-35B-A3B-SFT 32.4 21.5
+ Sim RL (uncontrolled) 31.5 24.6
+ Sim RL (controlled) 36.1 33.8
Δ +3.7 +12.3

Controllable Simulation: Search. Fictional-World Construction -- agents trained in fully invented, self-consistent worlds generalize to real search tasks:

Model WideSearch F1 Item WideSearch F1 Row
Qwen3.5-35B-A3B-SFT 34.02 13.72
+ Sim RL (controlled) 50.31 24.21
Δ +16.29 +10.49
Qwen3.5-397B-A17B-SFT 70.11 45.69
+ Sim RL (controlled) 73.98 51.74
Δ +3.87 +6.05

Agent Foundation Model. LWM RL warm-up on single-turn, non-agentic trajectories transfers to multi-turn, tool-calling agentic tasks:

Terminal-Bench 2.0 SWE-Bench Verified SWE-Bench Pro WideSearch F1 Item Claw-Eval QwenClawBench BFCL v4
In Domain Out of Domain
Qwen3.5-35B-A3B-SFT 33.25 64.47 42.18 33.38 53.60 39.76 62.29
w/ LWM RL 39.55 67.86 47.42 46.17 64.88 49.43 71.25
Δ +6.30 +3.39 +5.24 +12.79 +11.28 +9.67 +8.96

For detailed results, please check out the blog and the technical report.

Quickstart

Deployment

Qwen-AgentWorld-35B-A3B is supported by multiple inference frameworks. Here we demonstrate the usage of SGLang and vLLM.

SGLang

SGLang is a fast serving framework for large language models.

python -m sglang.launch_server \
    --model-path Qwen/Qwen-AgentWorld-35B-A3B \
    --port 8000 \
    --tensor-parallel-size 4 \
    --context-length 262144 \
    --reasoning-parser qwen3

An OpenAI-compatible API will be available at http://localhost:8000/v1.

vLLM

vLLM is a high-throughput and memory-efficient inference engine for LLMs.

vllm serve Qwen/Qwen-AgentWorld-35B-A3B \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 262144 \
    --reasoning-parser qwen3 \
    --language-model-only \
    --trust-remote-code

Tip

The --language-model-only flag is required because the model architecture includes visual component definitions but the checkpoint only contains language model weights. Without this flag, vLLM will attempt to initialize visual modules and fail.

An OpenAI-compatible API will be available at http://localhost:8000/v1.

Inference with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen-AgentWorld-35B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {
        "role": "system",
        "content": "You are a language world model simulating a Linux terminal environment. "
                   "Given the user's command, predict the terminal output."
    },
    {
        "role": "user",
        "content": "Action: execute_bash\nCommand: ls -la /home/user/project/"
    }
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.6)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Note

We provide domain-specific world model system prompt templates in prompts/ for all seven domains. These can serve as general-purpose system prompts when using Qwen-AgentWorld as an environment simulator. Each domain folder contains a system_prompt.txt (world model system prompt) and a judge_system_prompt.txt (evaluation prompt).

Evaluate on AgentWorldBench

AgentWorldBench evaluates language world models by scoring each predicted environment observation on five dimensions: Format, Factuality, Consistency, Realism, and Quality.

AgentWorldBench Overview

Overview of AgentWorldBench: domain distribution, source benchmarks, evaluation dimensions, and per-domain trajectory statistics.

Setup

# Download the benchmark
huggingface-cli download Qwen/AgentWorldBench --repo-type dataset --local-dir ./AgentWorldBench

# Install dependencies
pip install openai

Data Format

AgentWorldBench consists of per-domain JSONL files (mcp_test.jsonl, search_test.jsonl, terminal_test.jsonl, swe_test.jsonl, android_test.jsonl, web_test.jsonl, os_test.jsonl).

{
    "task": "mcp",
    "id": 145256090131919,
    "prompt": ["### Turn 1\n**Action:**\n```json\n{...}\n```\n..."],
    "response": ["**Environment Observation:**\n{...}"],
    "current_prompt": "### Turn 1\n**Action:**\n...",
    "system_str": "# Role and Objective\n\nYou are a **Tool World Model** ...",
    "turn_idx": 1,
    "total_turns": 5
}

Key fields:

  • system_str: The world model system prompt for this specific sample. Each sample carries its own system prompt, so the prompts provided in the prompts/ directory of this repo are templates for reference only.
  • prompt / response: Lists of all turns in the trajectory (action prompts and ground-truth environment observations).
  • current_prompt: The action prompt for the turn being evaluated.
  • turn_idx: 1-indexed position of the current turn.

Run Evaluation

We provide a standalone evaluation script (eval/eval.py) that uses the OpenAI-compatible API for both world model inference and LLM judge scoring. The evaluation follows a three-step pipeline:

cd eval

# Step 1: Run world model inference
python eval.py infer \
    --data-dir ../AgentWorldBench \
    --model-base-url http://localhost:8000/v1 \
    --model-name Qwen/Qwen-AgentWorld-35B-A3B \
    --output-dir ./results

# Step 2: Run LLM judge scoring
export OPENAI_API_KEY="your-api-key"
python eval.py judge \
    --predictions ./results/predictions.jsonl \
    --judge-base-url https://api.openai.com/v1 \
    --judge-model gpt-5.2-2025-12-11 \
    --output-dir ./results

# Step 3: Aggregate and display scores
python eval.py score --predictions ./results/judged.jsonl

The judge prompts used for scoring are located in prompts/{domain}/judge_system_prompt.txt. The world model system prompts in prompts/{domain}/system_prompt.txt are provided as reference templates; during evaluation, the system prompt from each sample's system_str field is used instead.

Evaluation Output

The score command outputs per-domain and overall results:

======================================================================
AgentWorldBench Evaluation Results (example output)
======================================================================

--- MCP (286/286 valid, 0 failed) ---
         format: 81.46
     factuality: 68.75
    consistency: 72.92
       realism: 71.88
        quality: 67.08
    total_score: 72.42
...

======================================================================
Overall: 56.39
======================================================================

Finetuning

We advise you to use training frameworks, including Swift, Llama-Factory, UnSloth, etc., to finetune the model on domain-specific environment data.

License Agreement

All open-weight models and AgentWorldBench are licensed under Apache 2.0. You can find the license files in the respective Hugging Face repositories.

Citation

If you find our work helpful, feel free to give us a cite.

@article{zuo2026qwen,
  title={Qwen-agentworld: language world models for general agents},
  author={Zuo, Yuxin and Xiao, Zikai and Sheng, Li and Huang, Fei and Tu, Jianhong and Liu, Yuxuan and Tang, Tianyi and Hu, Xiaomeng and Su, Yang and Lan, Qingfeng and others},
  journal={arXiv preprint arXiv:2606.24597},
  year={2026}
}

Contact Us

If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups!

About

Qwen-AgentWorld: Language World Models for General Agents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages