SKI Models: SKeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living (AAAI 2025)

This is the official repository of 'SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living' (AAAI 2025).

Abstract: The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes.
In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.

Installation

This codebase is tested on Ubuntu 20.04.2 LTS with python 3.8. Follow the below steps to create environment and install dependencies.

Setup conda environment (recommended).

# Create a conda environment
conda create -y -n ski_env python=3.7
# Activate the environment
conda activate ski_env
# Install requirements
pip install -r requirements.txt

Install Apex for enabling mixed-precision training.

NOTE: Make sure to have system CUDA of same version as of PyTorch CUDA version to properly install apex.

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

SKI-VLM

Data Preparation

We use the datasets NTU RGB+D, and NTU RGB+D 120. The zero-shot train and test splits are provided in the labels directory. Based on the splits, create the train and test csv files in the following format:

path_to_video_1,path_to_video_1_skeleton,label_1
path_to_video_2,path_to_video_2_skeleton,label_2
...
path_to_video_N,path_to_video_N_skeleton,label_N

Training

For all experiments, we provide config files in configs directory. For example, to train SKI-VLM on NTU, after setting the paths in the config file and bash scripts, run the following commands:

bash scripts/train_p1_SkeletonCLIP.sh
bash scripts/train_p2_SKIViFiCLIP.sh

Evaluating models

To evaluate a model, please use a suitable config and corresponding model weights and run the command below:

bash scripts/eval_SKIViFiCLIP.sh

SKI-LVLM

For SKI-LVLM, we provide a script to evaluate the model on the Charades dataset. Please refer to the SKI-LVLM README for more details.

Citation

If you use our approach (code or methods) in your research, please consider citing:

@inproceedings{sinha2025skimodels,
   title={SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living}, 
   author={Arkaprava Sinha and Dominick Reilly and Francois Bremond and Pu Wang and Srijan Das},
   booktitle={Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)},
   year={2025}
}

Acknowledgements

We sincerely thank the authors of ViFi-CLIP, Hyperformer, and LLAVIDAL for providing the codebases.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
SKI_LVLM		SKI_LVLM
apex		apex
clip		clip
configs		configs
datasets		datasets
docs		docs
graph		graph
labels		labels
scripts		scripts
torchlight		torchlight
trainers		trainers
utils		utils
visualization		visualization
.gitignore		.gitignore
P1_SkelCLIP_main.py		P1_SkelCLIP_main.py
P2_SKIVLM_main.py		P2_SKIVLM_main.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SKI Models: SKeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living (AAAI 2025)

Installation

SKI-VLM

Data Preparation

Training

Evaluating models

SKI-LVLM

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

thearkaprava/SKI-Models

Folders and files

Latest commit

History

Repository files navigation

SKI Models: SKeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living (AAAI 2025)

Installation

SKI-VLM

Data Preparation

Training

Evaluating models

SKI-LVLM

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages