Can machines reason about the physical world using both sight and sound, just as humans do, even when one sense is missing? Physical audiovisual commonsense reasoning demands not only perceiving objects and their interactions, but also imagining counterfactual outcomes under varying conditions. Yet most models struggle to disentangle time-invariant object properties from time-varying dynamics, and falter when modalities like audio or video are absent. To bridge this gap, we propose Robust Disentangled Counterfactual Learning (RDCL): a plug-and-play framework that (1) disentangles static and dynamic factors via a variational sequential encoder, (2) enhances causal reasoning through counterfactual intervention over physical relationships, and (3) robustly handles missing modalities by separating shared and modality-specific features. RDCL seamlessly integrates with any vision-language model, enabling more human-like, robust, and explainable physical reasoning in the real world.
-
2025-10-25🎉 Paper accepted to TPAMI 2025! -
2025-07-24🚀 Added inference and fine-tuning code for Qwen2.5-VL on PACS, along with new distance metrics for physical knowledge correlation. -
2025-02-17❤️ Released extended version RDCL (Robust Disentangled Counterfactual Learning) for audiovisual physical reasoning with missing modalities. -
2023-12-10🎉 Paper accepted to NeurIPS 2023!
- Updates
- Contents
- Paper
- Dataset
- Download Model Weights
- Requirements
- Training
- Prediction
- Qwen2.5-VL Integration
- Baseline with Qwen Visual Encoder
- Acknowledgements
- Citation
- NeurIPS 2023: Disentangled Counterfactual Learning for Physical Commonsense Reasoning
- arXiv 2025 (Extended): Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning (RDCL)
We use the PACS dataset for physical reasoning and a Material Classification dataset for attribute disentanglement.
The extended RDCL dataset (with VLM-generated object descriptions and audio) is available via Baidu Netdisk:
- 🔗 Baidu Netdisk (Extraction Code:
v458)
Place all downloaded assets into the assets/ folder.
wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.ptDownload from AudioCLIP Releases:
AudioCLIP-Partial-Training.ptbpe_simple_vocab_16e6.txt.gz
- Python 3.8.10
- PyTorch 1.11.0
- CUDA 11.3
conda create --name dcl python=3.8
conda activate dcl
pip install -r requirements.txt# PACS
python3 train_1.py
# Material Classification
python3 train_classify.py# e.g., missing audio
python3 train_1.py --miss_modal audio
python3 train_classify.py --miss_modal audiopython3 predict.py -model_path PATH_TO_MODEL_WEIGHTS -split test# Inference data
python3 PACS_data/scripts/processing_qwen_inference_multi_video_data.py
# Fine-tuning data
python3 PACS_data/scripts/processing_qwen_finetune_data.pypython PACS_inference.py \
--model_dir Qwen/Qwen2.5-VL-3B-Instruct \
--tokenizer_dir Qwen/Qwen2.5-VL-3B-Instruct \
--split test --data_type datacd qwen-vl-finetune
sh scripts/sft_PACS.sh✅ Trained on 4×V100 GPUs. See Qwen2.5-VL Finetune for details.
To accelerate training, we extract visual features offline.
# Video frames
python3 extract_feature.py --model_size 3B
python3 extract_feature.py --model_size 7B
python3 extract_feature.py --model_size 32B
# Single images
python3 extract_feature_single_image.py --model_size 3Bpython3 train_qwen_baseline.py --Qwen2_5_Size 3Bpython3 train_qwen_baseline.py --sim_type euclidean
python3 train_qwen_baseline.py --sim_type manhattanThis project builds upon open-source efforts from:
We sincerely thank Andrey Guzhov, Samuel Yu, and the broader research community for their foundational contributions.
If you find our work useful, please cite:
@inproceedings{lv2023disentangled,
title={Disentangled Counterfactual Learning for Physical Commonsense Reasoning},
author={Lv, Changsheng and Qi, Mengshi and Li, Xia and Yang, Zhengyuan and Ma, Huadong},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2023}
}
@ARTICLE{11222969,
author={Qi, Mengshi and Lv, Changsheng and Ma, Huadong},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning},
year={2025},
volume={},
number={},
pages={1-14},
doi={10.1109/TPAMI.2025.3627224}}
⭐ Don’t forget to star this repo if you find it helpful!
