Qwen-AgentWorld

📑 Technical Report | 📖 Blog | 🤗 Hugging Face | 🤖 ModelScope | 🖥️ Demo

Welcome to the GitHub repository of Qwen-AgentWorld. Here, you can find official information about Qwen-AgentWorld, post your questions (Issues), and share your ideas with the community (Discussions).

News

2026-06-24: We release Qwen-AgentWorld-35B-A3B and AgentWorldBench. Read more on our blog and the technical report.

Open-Source Release

We open-source Qwen-AgentWorld-35B-A3B (model weights) and AgentWorldBench (evaluation benchmark):

Release	Description
Qwen-AgentWorld-35B-A3B	Language world model (MoE, 35B total / 3B active, 256K context)
AgentWorldBench	Evaluation benchmark across 7 domains

The official weights and data are released on:

🤗 HuggingFace: Download automatically via model ID, e.g., Qwen/Qwen-AgentWorld-35B-A3B. You can also download model files manually using huggingface download or git clone. Please follow the instructions on the model page.
🤖 ModelScope: For users unable to access Hugging Face Hub. For supported frameworks, you can download from ModelScope by setting environment variables, such as SGLANG_USE_MODELSCOPE=true or VLLM_USE_MODELSCOPE=true.

Introduction

Qwen-AgentWorld is a native language world model that simulates agentic environments via long chain-of-thought reasoning across seven unified domains: MCP, Search, Terminal, SWE, Android, Web, and OS. It is trained through a three-stage pipeline -- CPT injects environment knowledge, SFT activates next-state-prediction reasoning, RL sharpens simulation fidelity -- on more than 10M real-world interaction trajectories. Unlike prior approaches that treat world modeling as a post-hoc add-on, Qwen-AgentWorld is a native world model: environment modeling is the training objective from the CPT stage onward.

Key features:

Seven Unified Domains. The first language world model to cover seven agent interaction domains within a single model.
Native World Model. Environment modeling from CPT onward, not post-hoc adaptation.
Generalizable, Scalable & Controllable Simulator. Zero-shot generalization to OOD environments (e.g., Claw Agent); controllable perturbations and fictional-world construction surpass real-environment training.
Agent Foundation Model. LWM RL warm-up on single-turn, non-agentic trajectories transfers to multi-turn, tool-calling agentic tasks across seven benchmarks, including three entirely out-of-domain.

Performance

Five-dimensional rubric mean (↑) per domain, normalized to 0--100 scale.

Model	MCP	Search	Term.	SWE	Android	Web	OS	Overall
GPT-5.4	70.10	37.26	53.69	66.29	60.00	51.80	68.58	58.25
Claude Opus 4.8	54.93	35.14	59.18	64.10	61.50	54.66	66.62	56.59
Claude Opus 4.6	69.90	29.30	57.51	64.55	61.74	51.42	70.20	57.80
Gemini 3.1 Pro	59.07	30.21	52.47	59.07	61.40	52.83	66.92	54.57
Claude Sonnet 4.6	70.00	28.79	56.98	64.52	58.03	50.78	63.17	56.04
DeepSeek-V4-Pro	63.27	27.61	51.26	59.44	55.17	50.32	63.70	52.97
GLM-5.1	67.60	22.46	47.32	52.07	59.10	51.50	59.13	51.31
Kimi K2.6	65.23	27.48	52.54	58.77	58.93	50.20	60.80	53.42
MiniMax-M2.7	55.82	27.30	41.62	37.44	52.40	50.52	57.73	46.12
Qwen3.5-35B-A3B	57.87	25.98	46.13	47.58	53.18	47.10	56.27	47.73
Qwen3.5-397B-A17B	68.31	30.81	55.30	64.44	54.90	48.55	60.85	54.74
Qwen3.6-Plus	55.28	21.94	50.58	59.08	57.65	50.78	60.33	50.81
Qwen-AgentWorld-35B-A3B	64.79	36.69	53.96	65.63	58.17	49.55	65.92	56.39
Qwen-AgentWorld-397B-A17B	68.24	37.82	57.73	68.49	60.20	50.98	67.89	58.71

Qwen-AgentWorld-397B-A17B achieves the highest overall score (58.71), outperforming all frontier proprietary models including GPT-5.4 (58.25). Qwen-AgentWorld-35B-A3B shows +8.66 improvement over Qwen3.5-35B-A3B without LWM training.

Applications

Generalizable Environment Scaling. Sim RL with Qwen-AgentWorld-397B-A17B on 4k out-of-distribution OpenClaw environments:

Model	Claw-Eval	QwenClawBench
Qwen3.5-35B-A3B	65.4	47.9
+ Sim RL (w/ Qwen3.6-Plus)	66.7	47.8
+ Sim RL (w/ Qwen-AgentWorld-397B-A17B)	69.7	55.0
Δ	+4.3	+7.1

Controllable Simulation: MCP. Environment Adaptation --- Control instructions inject targeted perturbations to expose agent weaknesses:

Model	Tool Decathlon	MCPMark
Qwen3.5-35B-A3B-SFT	32.4	21.5
+ Sim RL (uncontrolled)	31.5	24.6
+ Sim RL (controlled)	36.1	33.8
Δ	+3.7	+12.3

Controllable Simulation: Search. Fictional-World Construction -- agents trained in fully invented, self-consistent worlds generalize to real search tasks:

Model	WideSearch F1 Item	WideSearch F1 Row
Qwen3.5-35B-A3B-SFT	34.02	13.72
+ Sim RL (controlled)	50.31	24.21
Δ	+16.29	+10.49

Qwen3.5-397B-A17B-SFT	70.11	45.69
+ Sim RL (controlled)	73.98	51.74
Δ	+3.87	+6.05

Agent Foundation Model. LWM RL warm-up on single-turn, non-agentic trajectories transfers to multi-turn, tool-calling agentic tasks:

	Terminal-Bench 2.0	SWE-Bench Verified	SWE-Bench Pro	WideSearch F1 Item	Claw-Eval	QwenClawBench	BFCL v4
	In Domain				Out of Domain
Qwen3.5-35B-A3B-SFT	33.25	64.47	42.18	33.38	53.60	39.76	62.29
w/ LWM RL	39.55	67.86	47.42	46.17	64.88	49.43	71.25
Δ	+6.30	+3.39	+5.24	+12.79	+11.28	+9.67	+8.96

For detailed results, please check out the blog and the technical report.

Quickstart

Deployment

Qwen-AgentWorld-35B-A3B is supported by multiple inference frameworks. Here we demonstrate the usage of SGLang and vLLM.

SGLang

SGLang is a fast serving framework for large language models.

python -m sglang.launch_server \
    --model-path Qwen/Qwen-AgentWorld-35B-A3B \
    --port 8000 \
    --tensor-parallel-size 4 \
    --context-length 262144 \
    --reasoning-parser qwen3

An OpenAI-compatible API will be available at http://localhost:8000/v1.

vLLM

vLLM is a high-throughput and memory-efficient inference engine for LLMs.

vllm serve Qwen/Qwen-AgentWorld-35B-A3B \
    --port 8000 \
    --tensor-parallel-size 4 \
    --max-model-len 262144 \
    --reasoning-parser qwen3 \
    --language-model-only \
    --trust-remote-code

Tip

The --language-model-only flag is required because the model architecture includes visual component definitions but the checkpoint only contains language model weights. Without this flag, vLLM will attempt to initialize visual modules and fail.

An OpenAI-compatible API will be available at http://localhost:8000/v1.

Inference with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen-AgentWorld-35B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {
        "role": "system",
        "content": "You are a language world model simulating a Linux terminal environment. "
                   "Given the user's command, predict the terminal output."
    },
    {
        "role": "user",
        "content": "Action: execute_bash\nCommand: ls -la /home/user/project/"
    }
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.6)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Note

We provide domain-specific world model system prompt templates in prompts/ for all seven domains. These can serve as general-purpose system prompts when using Qwen-AgentWorld as an environment simulator. Each domain folder contains a system_prompt.txt (world model system prompt) and a judge_system_prompt.txt (evaluation prompt).

Evaluate on AgentWorldBench

AgentWorldBench evaluates language world models by scoring each predicted environment observation on five dimensions: Format, Factuality, Consistency, Realism, and Quality.

Overview of AgentWorldBench: domain distribution, source benchmarks, evaluation dimensions, and per-domain trajectory statistics.

Setup

# Download the benchmark
huggingface-cli download Qwen/AgentWorldBench --repo-type dataset --local-dir ./AgentWorldBench

# Install dependencies
pip install openai

Data Format

AgentWorldBench consists of per-domain JSONL files (mcp_test.jsonl, search_test.jsonl, terminal_test.jsonl, swe_test.jsonl, android_test.jsonl, web_test.jsonl, os_test.jsonl).

{
    "task": "mcp",
    "id": 145256090131919,
    "prompt": ["### Turn 1\n**Action:**\n```json\n{...}\n```\n..."],
    "response": ["**Environment Observation:**\n{...}"],
    "current_prompt": "### Turn 1\n**Action:**\n...",
    "system_str": "# Role and Objective\n\nYou are a **Tool World Model** ...",
    "turn_idx": 1,
    "total_turns": 5
}

Key fields:

system_str: The world model system prompt for this specific sample. Each sample carries its own system prompt, so the prompts provided in the prompts/ directory of this repo are templates for reference only.
prompt / response: Lists of all turns in the trajectory (action prompts and ground-truth environment observations).
current_prompt: The action prompt for the turn being evaluated.
turn_idx: 1-indexed position of the current turn.

Run Evaluation

We provide a standalone evaluation script (eval/eval.py) that uses the OpenAI-compatible API for both world model inference and LLM judge scoring. The evaluation follows a three-step pipeline:

cd eval

# Step 1: Run world model inference
python eval.py infer \
    --data-dir ../AgentWorldBench \
    --model-base-url http://localhost:8000/v1 \
    --model-name Qwen/Qwen-AgentWorld-35B-A3B \
    --output-dir ./results

# Step 2: Run LLM judge scoring
export OPENAI_API_KEY="your-api-key"
python eval.py judge \
    --predictions ./results/predictions.jsonl \
    --judge-base-url https://api.openai.com/v1 \
    --judge-model gpt-5.2-2025-12-11 \
    --output-dir ./results

# Step 3: Aggregate and display scores
python eval.py score --predictions ./results/judged.jsonl

The judge prompts used for scoring are located in prompts/{domain}/judge_system_prompt.txt. The world model system prompts in prompts/{domain}/system_prompt.txt are provided as reference templates; during evaluation, the system prompt from each sample's system_str field is used instead.

Evaluation Output

The score command outputs per-domain and overall results:

======================================================================
AgentWorldBench Evaluation Results (example output)
======================================================================

--- MCP (286/286 valid, 0 failed) ---
         format: 81.46
     factuality: 68.75
    consistency: 72.92
       realism: 71.88
        quality: 67.08
    total_score: 72.42
...

======================================================================
Overall: 56.39
======================================================================

Finetuning

We advise you to use training frameworks, including Swift, Llama-Factory, UnSloth, etc., to finetune the model on domain-specific environment data.

License Agreement

All open-weight models and AgentWorldBench are licensed under Apache 2.0. You can find the license files in the respective Hugging Face repositories.

Citation

If you find our work helpful, feel free to give us a cite.

@article{zuo2026qwen,
  title={Qwen-agentworld: language world models for general agents},
  author={Zuo, Yuxin and Xiao, Zikai and Sheng, Li and Huang, Fei and Tu, Jianhong and Liu, Yuxuan and Tang, Tianyi and Hu, Xiaomeng and Su, Yang and Lan, Qingfeng and others},
  journal={arXiv preprint arXiv:2606.24597},
  year={2026}
}

Contact Us

If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
assets		assets
eval		eval
prompts		prompts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Qwen-AgentWorld

News

Open-Source Release

Introduction

Performance

Applications

Quickstart

Deployment

SGLang

vLLM

Inference with Transformers

Evaluate on AgentWorldBench

Setup

Data Format

Run Evaluation

Evaluation Output

Finetuning

License Agreement

Citation

Contact Us

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Qwen-AgentWorld

News

Open-Source Release

Introduction

Performance

Applications

Quickstart

Deployment

SGLang

vLLM

Inference with Transformers

Evaluate on AgentWorldBench

Setup

Data Format

Run Evaluation

Evaluation Output

Finetuning

License Agreement

Citation

Contact Us

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages