Skip to content
forked from Andy20178/DCL

Code of NeurIPS2023 Paper《Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning》

Notifications You must be signed in to change notification settings

MICLAB-BUPT/DCL

 
 

Repository files navigation

RDCL:
Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning


arXiv arXiv Extended Dataset
1 Beijing University of Posts and Telecommunications  

Can machines reason about the physical world using both sight and sound, just as humans do, even when one sense is missing? Physical audiovisual commonsense reasoning demands not only perceiving objects and their interactions, but also imagining counterfactual outcomes under varying conditions. Yet most models struggle to disentangle time-invariant object properties from time-varying dynamics, and falter when modalities like audio or video are absent. To bridge this gap, we propose Robust Disentangled Counterfactual Learning (RDCL): a plug-and-play framework that (1) disentangles static and dynamic factors via a variational sequential encoder, (2) enhances causal reasoning through counterfactual intervention over physical relationships, and (3) robustly handles missing modalities by separating shared and modality-specific features. RDCL seamlessly integrates with any vision-language model, enabling more human-like, robust, and explainable physical reasoning in the real world.

Updates

  • 2025-10-25 🎉 Paper accepted to TPAMI 2025!

  • 2025-07-24 🚀 Added inference and fine-tuning code for Qwen2.5-VL on PACS, along with new distance metrics for physical knowledge correlation.

  • 2025-02-17 ❤️ Released extended version RDCL (Robust Disentangled Counterfactual Learning) for audiovisual physical reasoning with missing modalities.

  • 2023-12-10 🎉 Paper accepted to NeurIPS 2023!

Contents

Paper

Dataset

We use the PACS dataset for physical reasoning and a Material Classification dataset for attribute disentanglement.
The extended RDCL dataset (with VLM-generated object descriptions and audio) is available via Baidu Netdisk:

Results

Download Model Weights

Place all downloaded assets into the assets/ folder.

CLIP

wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt

AudioCLIP

Download from AudioCLIP Releases:

  • AudioCLIP-Partial-Training.pt
  • bpe_simple_vocab_16e6.txt.gz

Requirements

  • Python 3.8.10
  • PyTorch 1.11.0
  • CUDA 11.3
conda create --name dcl python=3.8
conda activate dcl
pip install -r requirements.txt

Training

DCL (NeurIPS 2023 Version)

# PACS
python3 train_1.py

# Material Classification
python3 train_classify.py

RDCL (TPAMI 2025) – with missing modalities

# e.g., missing audio
python3 train_1.py --miss_modal audio
python3 train_classify.py --miss_modal audio

Prediction

python3 predict.py -model_path PATH_TO_MODEL_WEIGHTS -split test

Qwen2.5-VL Integration

Data Processing

# Inference data
python3 PACS_data/scripts/processing_qwen_inference_multi_video_data.py

# Fine-tuning data
python3 PACS_data/scripts/processing_qwen_finetune_data.py

Inference

python PACS_inference.py \
  --model_dir Qwen/Qwen2.5-VL-3B-Instruct \
  --tokenizer_dir Qwen/Qwen2.5-VL-3B-Instruct \
  --split test --data_type data

Fine-tuning

cd qwen-vl-finetune
sh scripts/sft_PACS.sh

✅ Trained on 4×V100 GPUs. See Qwen2.5-VL Finetune for details.

Baseline with Qwen Visual Encoder

To accelerate training, we extract visual features offline.

Feature Extraction

# Video frames
python3 extract_feature.py --model_size 3B
python3 extract_feature.py --model_size 7B
python3 extract_feature.py --model_size 32B

# Single images
python3 extract_feature_single_image.py --model_size 3B

Training Baseline

python3 train_qwen_baseline.py --Qwen2_5_Size 3B

Alternative Distance Metrics

python3 train_qwen_baseline.py --sim_type euclidean
python3 train_qwen_baseline.py --sim_type manhattan

Acknowledgements

This project builds upon open-source efforts from:

We sincerely thank Andrey Guzhov, Samuel Yu, and the broader research community for their foundational contributions.

Citation

If you find our work useful, please cite:

@inproceedings{lv2023disentangled,
  title={Disentangled Counterfactual Learning for Physical Commonsense Reasoning},
  author={Lv, Changsheng and Qi, Mengshi and Li, Xia and Yang, Zhengyuan and Ma, Huadong},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2023}
}

@ARTICLE{11222969,
  author={Qi, Mengshi and Lv, Changsheng and Ma, Huadong},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning}, 
  year={2025},
  volume={},
  number={},
  pages={1-14},
  doi={10.1109/TPAMI.2025.3627224}}

⭐ Don’t forget to star this repo if you find it helpful!

About

Code of NeurIPS2023 Paper《Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning》

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 52.6%
  • Python 46.3%
  • Shell 1.1%