Skip to content

Official implementation of "Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks." CVPR 2025

License

Notifications You must be signed in to change notification settings

ninatu/utd-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

CVPR 2025

Nina Shvetsova  Arsha Nagrani  Bernt Schiele  Hilde Kuehne  Christian Rupprecht

Project Page    


UTD Dataset

We release the UTD dataset. Check the project webpage to download it. Our UTD dataset includes:

  1. UTD-descriptions
    Frame-level annotations for ~2 million videos, covering four conceptual categories visible in video frames:
    objects, activities, verbs, and objects+composition+activities (namely detailed free-form frame descriptions).
    Descriptions are provided for 8 uniformly sampled frames from the training and test/val splits of 12 widely-used activity recognition and text-to-video retrieval datasets.

  2. UTD-splits
    Object-debiased test/val splits for all 12 datasets β€” subsets in which object-biased samples have been removed. For 6 activity recognition datasets, we also provide debiased-balanced splits that preserve the class distribution while removing the most object-biased samples.

Included datasets:

πŸ‘‰ Download and read more on our webpage.


Benchmarking Video Models

We benchmarked 30 video models for action classification and text-to-video retrieval on 12 datasets, using both original and UTD-debiased splits. We include models from VideoMAE, VideoMAE2, All-in-one, Unmasked Teacher (UMT), VideoMamba, and InternVid.

For action classification, we follow the VideoGLUE setup: using a frozen backbone and training a lightweight pooling head.
For text-to-video retrieval, we report zero-shot performance and follow model-specific reranking procedures.

⚑ Quick Start

We provide all model predictions to compute accuracy (e.g., Top-1, Recall@1) across splits.

πŸ“₯ Download predictions into ./videoglue_predictions/: GDrive link.

πŸ“₯ Download UTD splits into ./UTD_splits/ following instructions on our project webpage.

Example:

Accuracy of videomae-B-UH on full test set of ucf:

python utd/videoglue/get_split_results.py \
  --pred_csv videoglue_predictions/ucf_videomae-B-UH/valid_49ep.csv  \
  --splits_path UTD_splits/splits_ucf_testlist01.json \
  --split_name full

on our UDT debiased-balanced split:

python utd/videoglue/get_split_results.py \
  --pred_csv videoglue_predictions/ucf_videomae-B-UH/valid_49ep.csv  \
  --splits_path UTD_splits/splits_ucf_testlist01.json \
  --split_name debiased-balanced

πŸ’‘ You can also evaluate on your custom splits by providing a JSON file with the split IDs.

βš™οΈ Benchmarking Setup

πŸ› οΈ Environment Setup

We tested env with cuda 11.6 only. Setup cuda path depending on your system (needed for videomamba):

module load cuda/11.6

or

cuda_version=11.6
export PATH=/usr/lib/cuda-${cuda_version}/bin/:${PATH}
export LD_LIBRARY_PATH=/usr/lib/cuda-${cuda_version}/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_PATH=/usr/lib/cuda-${cuda_version}/
export CUDA_HOME=/usr/lib/cuda-${cuda_version}/

Create python environment:

conda create -n videoglue python=3.9 -y
conda activate videoglue
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.6 -c pytorch -c conda-forge -y 
pip install -r requirements_videoglue.txt
cd third_party/VideoMamba
pip install -e causal-conv1d/ # only for videomamba
pip install -e mamba/ # only for videomamba
cd - 
pip install neptune==1.10.1 neptune-tensorboard==1.0.3
pip install -e . 

🎬 Download Datasets

Follow DATASETS.md for instructions on preparing all 12 datasets.

πŸ’Ύ Download Models Weights

Download model weights from official repositories: VideoMAE, VideoMAE2, All-in-one, UMT, VideoMamba, InternVid and store in: pretrained/{videomae, videomaev2, umt, videomamba, internvid, allinone}.

The official checkpoints of All-in-one and InternVid require some internal dependencies to load.
We removed those dependencies and provide the cleaned models here:
πŸ”— Cleaned Checkpoints

🎯 Benchmarking Classification

We follow VideoGLUE to train a light-weight classifier head on frozen video features.

⚑ Quick Start: Evaluation example (UCF101, VideoMAE-B-K400):

We release a set of trained light-weight classifiers:
πŸ“₯ Download models into ./videoglue_models/: GDrive link.

Note: --debiased_splits_path is optional

dataset_name='ucf'
prefix='data/UCF101/original/videos/'
data_path='metadata/videoglue/ucf/'
dataset_type='Kinetics_sparse'
debiased_splits_path='UTD_splits/splits_ucf_testlist01.json'
nb_classes=101

backbone_name='videomae-B-K400'
backbone_parameters='--backbone videomae --model vit_base_patch16_224  --num_frames 16 --backbone_checkpoint pretrained/videomae/videomae_checkpoint_pretrain_kin400.pth'

export MASTER_PORT=$((12000 + $RANDOM % 20000))
export OMP_NUM_THREADS=1

export PYTHONPATH="./third_party/:$PYTHONPATH"

output_dir='videoglue_models/' # important for loading checkpoint!!!

srun python utd/videoglue/train_classifier.py \
        ${backbone_parameters} \
        --exp_name ${dataset_name}_${backbone_name}  \
        \
        --data_path ${data_path} \
        --prefix ${prefix} \
        --data_set ${dataset_type} \
        --debiased_splits_path ${debiased_splits_path} \
        --split ',' \
        --nb_classes ${nb_classes} \
        --log_dir 'output/videoglue/logs' \
        --output_dir ${output_dir} \
        \
        --batch_size 64 \
        --val_batch_size_mul 0.5 \
        --num_sample 1 \
        --input_size 224 \
        --short_side_size 224 \
        --num_workers 16 \
        \
        --no_test \
        --dist_eval \
        --enable_deepspeed \
        --layer_decay 1.0 \
        \
        --eval

πŸ§ͺ See scripts/eval_classifier.sh for more examples.

Training example (UCF101, VideoMAE-B-K400):

Note: --debiased_splits_path is optional

# Set neptune.ai keys (optional)
neptune_api_token=''
neptune_project=''

dataset_name='ucf'
prefix='data/UCF101/original/videos/'
data_path='metadata/videoglue/ucf/'
dataset_type='Kinetics_sparse'
nb_classes=101

backbone_name='videomae-B-K400'
backbone_parameters='--backbone videomae --model vit_base_patch16_224  --num_frames 16 --backbone_checkpoint pretrained/videomae/videomae_checkpoint_pretrain_kin400.pth'

export MASTER_PORT=12345
export OMP_NUM_THREADS=1

export PYTHONPATH="./third_party/:$PYTHONPATH"

srun python utd/videoglue/train_classifier.py \
        ${backbone_parameters} \
        --exp_name ${dataset_name}_${backbone_name}  \
        \
        --data_path ${data_path} \
        --prefix ${prefix} \
        --data_set ${dataset_type} \
        --debiased_splits_path UTD_splits/splits_ucf_testlist01.json \
        --split ',' \
        --nb_classes ${nb_classes} \
        --log_dir 'output/videoglue/logs' \
        --output_dir 'output/videoglue/models' \
        \
        --ep 50 \
        --lr 0.001 \
        --weight_decay 0.05 \
        --batch_size 64 \
        --val_batch_size_mul 0.5 \
        --num_sample 1 \
        --input_size 224 \
        --short_side_size 224 \
        --num_workers 16 \
        \
        --aspect_ratio 0.5 2.0 \
        --area_ratio 0.3 1.0 \
        --aa rand-m9-n2-mstd0.5 \
        --reprob 0 \
        --mixup 0.8 \
        --cutmix 1.0 \
        \
        --warmup_epochs 5 \
        --opt adamw \
        --opt_betas 0.9 0.999 \
        \
        --no_test \
        --dist_eval \
        --enable_deepspeed \
        --layer_decay 1.0 \
        \
        --enable_neptune \
        --neptune_api_token ${neptune_api_token} \
        --neptune_project ${neptune_project}
         # disable the last block if you don't want to turn on neptune logging

πŸ§ͺ We use the same hyperparameters across all models/datasets. See scripts/train_classifier.sh for more examples.

🎯 Benchmarking Retrieval

For text-to-video retrieval, we report zero-shot performance and follow model-specific reranking procedures.

Example (msrvtt, umt-b-5M):

Note: --debiased_splits_path is optional

export PYTHONPATH="./third_party/:$PYTHONPATH"
python utd/videoglue/eval_retrieval.py \
      --backbone_name umt-b-5M \
      --backbone umt \
      --model vit_b16 \
      --backbone_checkpoint pretrained/umt/b16_5m.pth \
      --dataset msrvtt \
      --split test \
      --debiased_splits_path UTD_splits/splits_msrvtt_test.json \
      --dataset_root data/msrvtt/ \
      --dataset_metadata_root  metadata/msrvtt/ \
      --encoder_batch_size 32 --fusion_batch_size 32 \
      --output_root 'output/retrieval'

πŸ§ͺ See scripts/evaluate_retrieval.sh for commands of all models and dataset.

🧠 Unbiasing Through Textual Descriptions

πŸ› οΈ Set Up Environment

conda create -n utd python=3.10 -y
conda activate utd
pip install -r requirements_utd.txt
cd third_party/LLaVA
pip install -e . 
cd -
pip install protobuf==4.24.4 ffmpeg-python==0.2.0 scikit-image opencv_python==4.6.0.66 decord==0.6.0 av==10.0.0 sentence-transformers==2.7.0
pip install -e .

⚑ Quick Start

In our UTD dataset, we provide textual descriptions for all 12 datasets. For a quick start, you can use them to compute common sense representation bias.

πŸ“₯ Download UTD descriptions into ./UTD_descriptions/ following instructions on our project webpage.

Example: compute common sense bias for MSRVTT (objects, seq_of_f aggregation):

Note: use --retrieval key only for text-to-video retrieval datasets

dataset="msrvtt"
split="test"
concept="objects"
temporal_aggregation_type="seq_of_f"

python utd/utd/compute_common_sense_bias.py \
  --retrieval \
  --dataset ${dataset} --split ${split} \
  --temporal_aggregation_type ${temporal_aggregation_type} \
  --concept ${concept} \
  --text_descriptions UTD_descriptions/${dataset}_${split}_${concept}.pickle \
  --output_path  output/bias/${dataset}_${split}_${concept}_${temporal_aggregation_type}.csv

🎬 Download Datasets

Follow DATASETS.md for instructions on preparing all 12 datasets.

πŸ“ Generating Textual Descriptions

We release full code to generate textual descriptions (namely our UTD-descriptons) and estimate different types of representation bias.

1️⃣ Generate VLM Descriptions

We use LLaVA-1.6-Mistral-7B to generate detailed frame-level descriptions (objects+composition+activities), which include objects, composition, relationships, and activities.
For example, to process the MSRVTT test split:

dataset='msrvtt'
split='test'
dataset_root='data/msrvtt/'
python utd/utd/vlm_descriptions_llava.py \
  --dataset ${dataset} --split ${split} --dataset_root ${dataset_root} --metadata_root metadata/${dataset} \
  --output_path output/UTD_descriptions/${dataset}_${split}_objects+composition+activities.pickle \

πŸ“œ See commands for all datasets in scripts/1_run_vlm.sh.

πŸ’‘ Tip: To efficiently process large splits (e.g., kinetics_700 train), you can use the --num_data_chunks and --chunk_id arguments to process only a specific chunk of the input data in each process and run multiple jobs in parallel, each handling a different portion of the data.

2️⃣ Extract Concepts with LLM

Use the LLM to extract individual concepts such as objects, activities, or verbs.

Step 1: Extract objects or activities:

dataset='msrvtt'
split='test'
concept="objects"
#or concept="activities"
python utd/utd/llm_extract_concepts.py  \
  --concept ${concept} \
  --input_description_path output/UTD_descriptions/${dataset}_${split}_objects+composition+activities.pickle \
  --output_path output/UTD_descriptions/${dataset}_${split}_${concept}.pickle \
  --save_raw_llm_output_path output/UTD_descriptions/raw_${dataset}_${split}_${concept}.pickle

Step 2: Extract verbs from raw activities output (without parsing):

dataset='msrvtt'
split='test'
concept="verbs"
python utd/utd/llm_extract_concepts.py  \
  --concept ${concept} \
  --input_description_path output/UTD_descriptions/raw_${dataset}_${split}_activities.pickle \
  --output_path output/UTD_descriptions/${dataset}_${split}_${concept}.pickle

Step 3: Generate 15-words summaries of objects+composition+activities concise summaries for sequence-level reasoning in objects+composition+activities setup:

dataset='msrvtt'
split='test'
concept='objects+composition+activities_15_words'
  python utd/utd/llm_extract_concepts.py  \
    --concept ${concept} \
    --input_description_path output/UTD_descriptions/${dataset}_${split}_objects+composition+activities.pickle \
    --output_path output/UTD_descriptions/${dataset}_${split}_${concept}.pickle

πŸ“œ See commands for all datasets in scripts/2_run_llm.sh.

πŸ’‘ Tip: To efficiently process large splits (e.g., kinetics_700 train), you can use the --num_data_chunks and --chunk_id arguments to process only a specific chunk of the input data in each process and run multiple jobs in parallel, each handling a different portion of the data.

πŸ“ˆ Compute Representation Bias

1️⃣ Common Sense Bias

We use SFR-Embedding-Mistral for reasoning over textual descriptions.

Available concepts: objects, activities, verbs, objects+composition+activities, objects+composition+activities_15_words

Aggregation types: middle_f, max_score_f, avg_over_f, seq_of_f

Example: (MSRVTT, objects, seq_of_f):

Note: use --retrieval key only for text-to-video retrieval datasets

dataset="msrvtt"
split="test"
concept="objects"
temporal_aggregation_type="seq_of_f"

python utd/utd/compute_common_sense_bias.py \
  --retrieval \
  --dataset ${dataset} --split ${split} \
  --temporal_aggregation_type ${temporal_aggregation_type} \
  --concept ${concept} \
  --text_descriptions output/UTD_descriptions/${dataset}_${split}_${concept}.pickle \
  --output_path  output/bias/${dataset}_${split}_${concept}_${temporal_aggregation_type}.csv

πŸ“œ See commands for all datasets in scripts/3_common_sense_bias.sh

2️⃣ Dataset Bias (Classification Datasets Only)

We compute dataset bias as the performance of a linear classifier trained on textual embeddings extracted from the training set.

Steps:

  1. Extract and save text embeddings
  2. Train a linear classifier on those embeddings

πŸ“œ See the full pipeline in scripts/4_dataset_bias.sh

πŸ“„ Acknowledgments and Licenses

The code for LLaVA inference is partly derived from LLaVA which is licensed under the Apache License 2.0.
The code in utd/videoglue is partly based on Unmasked Teacher which is licensed under the MIT License. Licenses for third-party code in ./third_party/ are provided in each corresponding subfolder.
All other code is licensed under the MIT License. Full license terms are available in the LICENSE file.

πŸ“š Citation

If you use this code in your research, please cite:

@inproceedings{shvetsova2025utd,
  title     = {Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks},
  author    = {Shvetsova, Nina and Nagrani, Arsha and Schiele, Bernt and Kuehne, Hilde and Rupprecht, Christian},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2025}
}

About

Official implementation of "Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks." CVPR 2025

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published