Skip to content

limenlp/CausalVLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What’s Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning

Project Overview

This project addresses a critical gap in the evaluation of vision-language models: their ability to understand and reason about causal relationships. Although VLMs have demonstrated impressive performance on many downstream tasks, it remains unclear whether they truly grasp causal relations rather than relying on object recognition or activity identification shortcuts.

To bridge this gap, we introduce two new benchmarks:

  • VQA-Causal, VCR-Causal: Two new benchmarks designed to isolate and rigorously evaluate VLMs’ causal reasoning abilities.

Our key findings:

  1. VLMs excel at object and activity recognition but perform poorly on causal reasoning, often only marginally better than random guessing.
  2. This shortcoming is primarily due to a severe lack of explicit causal expressions in widely used training datasets.
  3. Fine-tuning with hard negative cases can significantly improve a model’s causal reasoning ability while preserving downstream performance and generalization.

Our study highlights a major limitation of current VLMs and lays the groundwork for future research on causal understanding.

Code & Data

1. Causal Order Reasoning Tests

The code for the causal order reasoning experiments can be found in the causaltest/ directory.
For example, to run the VQA-Causal tests:

python causaltest_clipfamily.py      # CLIP-family models (e.g. ViT-L/14, ViT-B/32, NegCLIP, RobustCLIP)
python causaltest_flava.py           # FLAVA
python llava_test.py                 # LLaVA
python vicuna_vqa.py                 # Vicuna

Running the scripts above yields results on CLIP-family models (CLIP ViT-L/14, CLIP ViT-B/32, NegCLIP, RobustCLIP), FLAVA, LLaVA, and Vicuna.

2. Object & Activity Understanding Tests

The code for the object‐and‐activity (O&A) tests can be found in the multichoice/ directory. For example:

python multichoice_clipfamily.py      # CLIP-family models (ViT-L/14, ViT-B/32, NegCLIP, RobustCLIP)
python multichoice_flava.py           # FLAVA

Running the scripts above yields results on CLIP-family models (CLIP ViT-L/14, CLIP ViT-B/32, NegCLIP, RobustCLIP) and FLAVA.

3. Data Analysis

The code for data analysis can be found in the data_analysis/ directory. for example:

python analysis_coco.py               # COCO dataset analysis
python analysis_laion400m.py          # LAION-400M dataset analysis
python analysis_vcr.py                # VCR dataset analysis
python analysis_vqaval.py             # VQA dataset analysis

Running the scripts above yields the data analysis results on MSCOCO, LAION-400M, VCR, VQA datasets.

4. Datasets

The datasets can be found in the datasets/ directory. for example, the VQA-Causal and VCR-Causal can be found in the datasets/benchmarks/ directory.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages