Shijie Lian1,2,* Bin Yu2,4,* Xiaopeng Lin2,5,* Laurence T. Yang6,1,† Zhaolong Shen2,7
Changti Wu2,8 Yuzhuo Miao2,4 Cong Huang2,3 Kai Chen2,3,9,†
1HUST, 2ZGCA, 3ZGCI, 4HIT, 5HKUST(GZ), 6ZZU, 7BUAA, 8ECNU, 9DeepCybo
*Equal contribution, †Corresponding author
Zhongguancun Academy &
Zhongguancun Institute of Artificial Intelligence
- [Feb 10, 2026] ⚡ LangForce has been integrated into starVLA. You can now directly train LangForce through starVLA and perform end-to-end training and evaluation on benchmarks such as LIBERO, SimplerEnv, and RoboCasa.
Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce:, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior
LangForce is a novel framework designed to solve the Vision Shortcut problem in Vision-Language-Action (VLA) models.
In current VLA training, goal-driven datasets often make language instructions highly predictable from visual observations alone. This leads to Information Collapse, where the model ignores language and degenerates into a vision-only policy, failing miserably in out-of-distribution (OOD) scenarios.LangForce addresses this by:
-
Bayesian Decomposition: Explicitly modeling a vision-only prior
$p(a|v)$ and a language-conditioned posterior$\pi(a|v, \ell)$ . - LLR Optimization: Maximizing the Log-Likelihood Ratio (LLR) to penalize actions that rely solely on visual cues and reward actions that are truly grounded in language instructions.
- Dual-Branch Architecture: Uses learnable Latent Action Queries to decouple vision-only and language-conditioned action distributions.
- Zero Extra Data: Achieves significant performance gains (e.g., +11.3% on SimplerEnv) using the exact same datasets as baselines.
- Preserves VLM Intelligence: Effectively regularizes the model to prevent the "catastrophic forgetting" of general multimodal reasoning capabilities common in standard VLA fine-tuning.
| Method | SimplerEnv (Avg) | RoboCasa (Avg) | LIBERO (Avg) |
|---|---|---|---|
| QwenGR00T (Baseline) | 55.2% | 47.8% | 96.5% |
| LangForce (Ours) | 66.5% (+11.3%) | 52.6% (+4.8%) | 98.4% (+1.9%) |
- Install starVLA : Our training pipeline is built upon the StarVLA framework. To get started, please follow the instructions below to set up the base environment.
🛠 starVLA Environment Setup
# Clone the repo
git clone https://github.com/starVLA/starVLA
# Create conda environment
conda create -n starVLA python=3.10 -y
conda activate starVLA
# Install requirements
pip install -r requirements.txt
# Install FlashAttention2
pip install flash-attn --no-build-isolation
# Install starVLA
pip install -e .In particular, we list the versions of the relevant packages we used below:
torch==2.6.0+cu12.4
flash-attention==2.7.4.post1
## If using Qwen3.5 as the VLM
flash-linear-attention==0.3.2
causal_conv1d==1.5.0.post8
-
Vocabulary Expansion: LangForce utilizes Qwen3-VL and extends the vocabulary with specialized tokens that serve as Latent Action Queries. Run the provided example script
add_token.pyto update the tokenizer with these additional tokens. -
Training Script: You can learn how to train LangForce using starVLA from here. Below, we provide a training script for LangForce on 8 × H100 GPUs:
conda activate starvla
cd /xxx/worlkplace/starVLA-v2.0
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1
export NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_TIMEOUT=1000 # timeout set to 1 hour (unit: seconds)
framework_name=LangForceV5
base_vlm=/xxx/starVLA-v2.0/playground/Pretrained_models/Qwen3-VL-4B-with-Action-Query
run_id=GR00T_Simpler_LangForce
freeze_module_list=''
config_yaml=./examples/SimplerEnv/train_files/starvla_cotrain_oxe.yaml
oxe_data_root=/xxx/starVLA/playground/Datasets/OXE_LEROBOT_DATASET/
data_mix=bridge
run_root_dir=./results/LangForce/SimplerEnv
output_dir=${run_root_dir}/${run_id}
mkdir -p ${output_dir}
accelerate launch \
--config_file starVLA/config/deepseeds/deepspeed_zero2.yaml \
--num_processes 8 \
starVLA/training/train_starvla.py \
--config_yaml ${config_yaml} \
--framework.name ${framework_name} \
--framework.qwenvl.base_vlm ${base_vlm} \
--framework.qwenvl.template ${vlm_template} \
--framework.detach_prior_cond ${detach_prior_cond} \
--framework.qwenvl.num_latent_action_query ${num_latent_action_query} \
--framework.action_model.diffusion_model_cfg.num_layers ${dit_num_layers} \
--datasets.vla_data.data_root_dir ${oxe_data_root}\
--datasets.vla_data.data_mix ${data_mix} \
--datasets.vla_data.per_device_batch_size ${per_device_batch_size} \
--trainer.freeze_modules ${freeze_module_list} \
--trainer.max_train_steps 100000 \
--trainer.save_interval 10000 \
--trainer.logging_frequency 100 \
--trainer.eval_interval 1000 \
--run_root_dir ${run_root_dir} \
--run_id ${run_id} \
--wandb_project starVLA \
--wandb_entity xxxLangForce is currently under active development. Feel free to check back frequently for updates and new features!
We would like to thank the starVLA project for its inspiring work and open-source contributions. At the same time, we also express our gratitude to the following projects:
If you find this project or the dataset helpful, please cite:
@misc{LangForce_2026_arXiv,
title={LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries},
author={Shijie Lian and Bin Yu and Xiaopeng Lin and Laurence T. Yang and Zhaolong Shen and Changti Wu and Yuzhuo Miao and Cong Huang and Kai Chen},
year={2026},
eprint={2601.15197},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.15197},
}