Skip to content

ZGC-EmbodyAI/LangForce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LangForce : Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

GitHub arXiv License

Shijie Lian1,2,* Bin Yu2,4,* Xiaopeng Lin2,5,* Laurence T. Yang6,1,† Zhaolong Shen2,7
Changti Wu2,8 Yuzhuo Miao2,4 Cong Huang2,3 Kai Chen2,3,9,†

1HUST, 2ZGCA, 3ZGCI, 4HIT, 5HKUST(GZ), 6ZZU, 7BUAA, 8ECNU, 9DeepCybo

*Equal contribution, Corresponding author

ZGCAZhongguancun Academy & ZGCIZhongguancun Institute of Artificial Intelligence


📢 News

  • [Feb 10, 2026] ⚡ LangForce has been integrated into starVLA. You can now directly train LangForce through starVLA and perform end-to-end training and evaluation on benchmarks such as LIBERO, SimplerEnv, and RoboCasa.

📖 Abstract

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce:, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $\pi(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, LangForce significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

🏗️ Architecture

LangForce is a novel framework designed to solve the Vision Shortcut problem in Vision-Language-Action (VLA) models.

LangForce Framework
In current VLA training, goal-driven datasets often make language instructions highly predictable from visual observations alone. This leads to Information Collapse, where the model ignores language and degenerates into a vision-only policy, failing miserably in out-of-distribution (OOD) scenarios.

LangForce addresses this by:

  1. Bayesian Decomposition: Explicitly modeling a vision-only prior $p(a|v)$ and a language-conditioned posterior $\pi(a|v, \ell)$.
  2. LLR Optimization: Maximizing the Log-Likelihood Ratio (LLR) to penalize actions that rely solely on visual cues and reward actions that are truly grounded in language instructions.

✨ Key Features

  • Dual-Branch Architecture: Uses learnable Latent Action Queries to decouple vision-only and language-conditioned action distributions.
  • Zero Extra Data: Achieves significant performance gains (e.g., +11.3% on SimplerEnv) using the exact same datasets as baselines.
  • Preserves VLM Intelligence: Effectively regularizes the model to prevent the "catastrophic forgetting" of general multimodal reasoning capabilities common in standard VLA fine-tuning.

📊 Performance

Method SimplerEnv (Avg) RoboCasa (Avg) LIBERO (Avg)
QwenGR00T (Baseline) 55.2% 47.8% 96.5%
LangForce (Ours) 66.5% (+11.3%) 52.6% (+4.8%) 98.4% (+1.9%)

🚀 Training

  1. Install starVLA : Our training pipeline is built upon the StarVLA framework. To get started, please follow the instructions below to set up the base environment.
🛠 starVLA Environment Setup
# Clone the repo
git clone https://github.com/starVLA/starVLA

# Create conda environment
conda create -n starVLA python=3.10 -y
conda activate starVLA

# Install requirements
pip install -r requirements.txt

# Install FlashAttention2
pip install flash-attn --no-build-isolation

# Install starVLA
pip install -e .

In particular, we list the versions of the relevant packages we used below:

torch==2.6.0+cu12.4
flash-attention==2.7.4.post1
## If using Qwen3.5 as the VLM
flash-linear-attention==0.3.2
causal_conv1d==1.5.0.post8
  1. Vocabulary Expansion: LangForce utilizes Qwen3-VL and extends the vocabulary with specialized tokens that serve as Latent Action Queries. Run the provided example script add_token.py to update the tokenizer with these additional tokens.

  2. Training Script: You can learn how to train LangForce using starVLA from here. Below, we provide a training script for LangForce on 8 × H100 GPUs:

conda activate starvla
cd /xxx/worlkplace/starVLA-v2.0

export NCCL_SOCKET_IFNAME=eth0        
export NCCL_IB_DISABLE=1       
export NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_TIMEOUT=1000  # timeout set to 1 hour (unit: seconds)

framework_name=LangForceV5
base_vlm=/xxx/starVLA-v2.0/playground/Pretrained_models/Qwen3-VL-4B-with-Action-Query
run_id=GR00T_Simpler_LangForce
freeze_module_list=''
config_yaml=./examples/SimplerEnv/train_files/starvla_cotrain_oxe.yaml
oxe_data_root=/xxx/starVLA/playground/Datasets/OXE_LEROBOT_DATASET/
data_mix=bridge
run_root_dir=./results/LangForce/SimplerEnv

output_dir=${run_root_dir}/${run_id}
mkdir -p ${output_dir}

accelerate launch \
  --config_file starVLA/config/deepseeds/deepspeed_zero2.yaml \
  --num_processes 8 \
  starVLA/training/train_starvla.py \
  --config_yaml ${config_yaml} \
  --framework.name ${framework_name} \
  --framework.qwenvl.base_vlm ${base_vlm} \
  --framework.qwenvl.template ${vlm_template} \
  --framework.detach_prior_cond ${detach_prior_cond} \
  --framework.qwenvl.num_latent_action_query ${num_latent_action_query} \
  --framework.action_model.diffusion_model_cfg.num_layers ${dit_num_layers} \
  --datasets.vla_data.data_root_dir ${oxe_data_root}\
  --datasets.vla_data.data_mix ${data_mix} \
  --datasets.vla_data.per_device_batch_size ${per_device_batch_size} \
  --trainer.freeze_modules ${freeze_module_list} \
  --trainer.max_train_steps 100000 \
  --trainer.save_interval 10000 \
  --trainer.logging_frequency 100 \
  --trainer.eval_interval 1000 \
  --run_root_dir ${run_root_dir} \
  --run_id ${run_id} \
  --wandb_project starVLA \
  --wandb_entity xxx

LangForce is currently under active development. Feel free to check back frequently for updates and new features!

🙏 Acknowledgements

We would like to thank the starVLA project for its inspiring work and open-source contributions. At the same time, we also express our gratitude to the following projects:

Citation

If you find this project or the dataset helpful, please cite:

@misc{LangForce_2026_arXiv,
      title={LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries}, 
      author={Shijie Lian and Bin Yu and Xiaopeng Lin and Laurence T. Yang and Zhaolong Shen and Changti Wu and Yuzhuo Miao and Cong Huang and Kai Chen},
      year={2026},
      eprint={2601.15197},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.15197}, 
}

Star History

Star History Chart

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages