MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent

Yuxia Fu*, Zhizhen Zhang*, Yuqi Zhang, Zijian Wang, Zi Huang, Yadan Luo

This repository provides the official implementation of MergeVLA.

📝 Paper | 🌍 Project Page | 🤗 HuggingFace

🎉 2026/02/21: Our paper has been accepted to CVPR 2026

🌟 Abstract

Recent Vision-Language-Action (VLA) models reformulate vision-language models by tuning them with millions of robotic demonstrations. While they perform well when fine-tuned for a single embodiment or task family, extending them to multi-skill settings remains challenging: directly merging VLA experts trained on different tasks results in near-zero success rates. This raises a fundamental question: what prevents VLAs from mastering multiple skills within one model? In this work, we identify two key sources of non-mergeability: (1) LoRA adapters in the VLM drift toward divergent, task-specific directions during fine-tuning, and (2) self-attention in action experts creates inter-block dependencies that prevent modular recomposition.

MergeVLA addresses these issues with a merging-oriented architecture that preserves mergeability across tasks. It employs sparsely activated LoRA adapters via task masks to reduce irreconcilable conflicts in the VLM, and apply cross-attention-only action experts to keep specialization localized. A task router selects the appropriate mask and expert head from the initial observation to enable unsupervised task inference.

🚀 Quick Start

# Create and activate conda environment
conda create -n mergevla python=3.10.16 -y
conda activate mergevla

# Install PyTorch
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0

# Install necessary packages
pip install packaging ninja
ninja --version; echo $?  # Should return exit code "0"
pip install "flash-attn==2.5.5" --no-build-isolation
pip install git+https://github.com/moojink/dlimp_openvla

# Install LIBERO
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
pip install -e .
pip install -r experiments/robot/libero/libero_requirements.txt

# Install FusionBench
cd fusion_bench
pip install -e .

📦 Data Preparation

LIBERO Benchmark

The LIBERO datasets can be downloaded directly from here or obtained following the official LIBERO documentation. To train on LIBERO, the raw demonstrations must be converted into the RLDS format. You may either download the RLDS-converted version from here or convert by yourself using this code.

Performance on LIBERO benchmark.

Method	Spatial	Object	Goal	Long-10	Avg
$\mathrm{MergeVLA}_\mathrm{FT}$	98.0	98.6	95.0	95.0	96.7
$\mathrm{MergeVLA}_\mathrm{EMR}$	96.0	63.2	62.0	40.6	65.5
$\mathrm{MergeVLA_{TSV}}$	99.4	97.8	74.4	54.8	81.6
$\mathrm{MergeVLA_{KnOTS}}$	96.8	98.8	84.8	71.4	88.0
$\mathrm{MergeVLA_{TA}}$	98.0	98.8	85.4	76.6	89.7
$\mathrm{MergeVLA_{WUDI}}$	97.6	98.2	85.6	78.2	89.9
$\mathrm{MergeVLA_{TIES}}$	94.8	94.6	91.8	79.4	90.2

Real World Dataset

For real-world experiments, we follow the same training pipeline as LIBERO. The only difference is that the dataset is first converted into the LeRobot format before training. You can convert your dataset via any4lerobot.

🔥 Training

Our model is trained based on the Qwen2.5-0.5B VLM, so you must download the pretrained VLM and place it under /pretrained_models before starting training. The training process can then be launched using the script bash_scripts/finetune_libero.sh. All training is performed on a single NVIDIA A6000 Ada 48GB GPU (approximately 26GB memory usage). Most task suites finish within a few hours, while the Long-10 suite requires around 24 hours. The training length is controlled by the --max_steps argument:

	Spatial	Object	Goal	Long-10
Steps	30,000	20,000	30,000	50,000

Run the following command to train MergeVLA:

bash bash_scripts/finetune_libero.sh

The MergeVLA expert models trained on LIBERO are available here.

🔀 Model Merging

Model merging is implemented in model_merging/mergy.py. The merge algorithm is selected using algo_name = ["TATallMask", "weighted_average"], where the first option merges the VLM of the model (which relies on a pretrained backbone), and the second merges the un-pretrained components, namely the action query, action head, and proprio projector, using weighted averaging by default. All available algorithms are implemented in get_algo(). Because merging requires access to the pretrained VLM and loading it directly is slow, we use save_vlm() to store the pretrained VLM inside the MergeVLA structure with zero-initialized action queries, and then use load_vlm_from_vla() for fast reloading during subsequent merges. The merge() function supports two modes:

Without eval_task, it produces a generalist model by merging all tasks and constructing a MoE action head with stored task masks.
With eval_task, it generates a task-specific model by performing merging on shared components while keeping non-mergeable parts from the target task and applying the corresponding mask.

if __name__ == "__main__":
    merged_tasks = ["spatial", "object", "goal", "10"]
    algo_name = ["TATallMask", "weighted_average"]

    # Merge all tasks into a single unified model with MoE action head (store all task-specific components and masks)
    action_head_layer_num = 1
    k_gate = 8
    merge(merged_tasks=merged_tasks, algo_name=algo_name, k_gate=k_gate, action_head_layer_num=action_head_layer_num, 
                          note=f'AHnum_{action_head_layer_num}_k_{k_gate}')
    
    # Merge models for a specific evaluation task
    eval_task = "spatial"
    merge(merged_tasks=merged_tasks, algo_name=algo_name, 
          eval_task=TaskSuite(tasks[eval_task]),
          note=f'eval_{eval_task}')

🧪 Evaluation

Main evaluation script is located in experiments/robot/libero/run_libero_eval.py. In standard evaluation (fine-tuned Model evaluation), each task suite requires a separate checkpoint. In merged model evaluation, a single merged checkpoint can be evaluated across all task suites. In this case, we first use task_router() to compute the most appropriate task mask and action expert via the test-time router, and these are then applied for subsequent evaluation. The --task_suite_name argument is only used to load the task data; after routing, expert_name and expert_idx determine which mask and action expert to use.

# Evaluate fine-tuned Model
bash bash_scripts/eval.sh

# Evaluate merged models
bash bash_scripts/eval_merged.sh

For LIBERO-plus evaluation, we use the same checkpoints trained on LIBERO (available via the links above); only the environment needs to be switched to LIBERO-plus.

We also provide an implementation compatible with the LeRobot framework. To use it, first install LeRobot following the official instructions. Then copy lerobot_code/mergevla from this repository into src/lerobot/policies under your local LeRobot codebase. The policy can be loaded as follows:

from lerobot.policies.mergevla import MergeVLAPolicy, MergeVLAConfig

cfg = MergeVLAConfig(
    pretrained_checkpoint="path/to/your/checkpoint",
    device="cuda",
)
policy = MergeVLAPolicy(cfg)

Note that the current LeRobot-compatible code does not support the router or MoE-based action head. Therefore, it can only load a single set of task-specific weights. As described in Model Merging, please use the eval_task mode to merge the model.

📝 Citation

If you find this work useful in your research, please consider citing:

@misc{fu2025mergevla,
      title={MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent}, 
      author={Yuxia Fu and Zhizhen Zhang and Yuqi Zhang and Zijian Wang and Zi Huang and Yadan Luo},
      year={2025},
      eprint={2511.18810},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2511.18810}, 
}

❤️ Acknowledgment

Our project code is built upon the following open-sourced projects:

OpenvLA, VLA-Adapter, FusionBench

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
bash_scripts		bash_scripts
eval_logs		eval_logs
experiments/robot		experiments/robot
figures		figures
fusion_bench		fusion_bench
lerobot_code/mergevla		lerobot_code/mergevla
model_merging		model_merging
pretrained_models/configs		pretrained_models/configs
prismatic		prismatic
vla-scripts		vla-scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent

🌟 Abstract

📜 Table of Contents

🚀 Quick Start

📦 Data Preparation

LIBERO Benchmark

Performance on LIBERO benchmark.

Real World Dataset

🔥 Training

🔀 Model Merging

🧪 Evaluation

📝 Citation

If you find this work useful in your research, please consider citing:

❤️ Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent

🌟 Abstract

📜 Table of Contents

🚀 Quick Start

📦 Data Preparation

LIBERO Benchmark

Performance on LIBERO benchmark.

Real World Dataset

🔥 Training

🔀 Model Merging

🧪 Evaluation

📝 Citation

If you find this work useful in your research, please consider citing:

❤️ Acknowledgment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages