Skip to content

[AAAI 2025] Official Repository of 'SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living'

Notifications You must be signed in to change notification settings

thearkaprava/SKI-Models

Repository files navigation

SKI Models: SKeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living (AAAI 2025)

arXiv

This is the official repository of 'SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living' (AAAI 2025).

skimodels_figure

Abstract: The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes.
In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.

Installation

This codebase is tested on Ubuntu 20.04.2 LTS with python 3.8. Follow the below steps to create environment and install dependencies.

  • Setup conda environment (recommended).
# Create a conda environment
conda create -y -n ski_env python=3.7
# Activate the environment
conda activate ski_env
# Install requirements
pip install -r requirements.txt
  • Install Apex for enabling mixed-precision training.

NOTE: Make sure to have system CUDA of same version as of PyTorch CUDA version to properly install apex.

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

SKI-VLM

Data Preparation

We use the datasets NTU RGB+D, and NTU RGB+D 120. The zero-shot train and test splits are provided in the labels directory. Based on the splits, create the train and test csv files in the following format:

path_to_video_1,path_to_video_1_skeleton,label_1
path_to_video_2,path_to_video_2_skeleton,label_2
...
path_to_video_N,path_to_video_N_skeleton,label_N

Training

For all experiments, we provide config files in configs directory. For example, to train SKI-VLM on NTU, after setting the paths in the config file and bash scripts, run the following commands:

bash scripts/train_p1_SkeletonCLIP.sh
bash scripts/train_p2_SKIViFiCLIP.sh

Evaluating models

To evaluate a model, please use a suitable config and corresponding model weights and run the command below:

bash scripts/eval_SKIViFiCLIP.sh

SKI-LVLM

For SKI-LVLM, we provide a script to evaluate the model on the Charades dataset. Please refer to the SKI-LVLM README for more details.

Citation

If you use our approach (code or methods) in your research, please consider citing:

@inproceedings{sinha2025skimodels,
   title={SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living}, 
   author={Arkaprava Sinha and Dominick Reilly and Francois Bremond and Pu Wang and Srijan Das},
   booktitle={Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)},
   year={2025}
}

Acknowledgements

We sincerely thank the authors of ViFi-CLIP, Hyperformer, and LLAVIDAL for providing the codebases.

About

[AAAI 2025] Official Repository of 'SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living'

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •