SKI Models: SKeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living (AAAI 2025)
This is the official repository of 'SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living' (AAAI 2025).
Abstract: The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes.
In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.
This codebase is tested on Ubuntu 20.04.2 LTS with python 3.8. Follow the below steps to create environment and install dependencies.
- Setup conda environment (recommended).
# Create a conda environment
conda create -y -n ski_env python=3.7
# Activate the environment
conda activate ski_env
# Install requirements
pip install -r requirements.txt- Install Apex for enabling mixed-precision training.
NOTE: Make sure to have system CUDA of same version as of PyTorch CUDA version to properly install apex.
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
We use the datasets NTU RGB+D, and NTU RGB+D 120. The zero-shot train and test splits are provided in the labels directory. Based on the splits, create the train and test csv files in the following format:
path_to_video_1,path_to_video_1_skeleton,label_1
path_to_video_2,path_to_video_2_skeleton,label_2
...
path_to_video_N,path_to_video_N_skeleton,label_N
For all experiments, we provide config files in configs directory. For example, to train SKI-VLM on NTU, after setting the paths in the config file and bash scripts, run the following commands:
bash scripts/train_p1_SkeletonCLIP.sh
bash scripts/train_p2_SKIViFiCLIP.sh
To evaluate a model, please use a suitable config and corresponding model weights and run the command below:
bash scripts/eval_SKIViFiCLIP.sh
For SKI-LVLM, we provide a script to evaluate the model on the Charades dataset. Please refer to the SKI-LVLM README for more details.
If you use our approach (code or methods) in your research, please consider citing:
@inproceedings{sinha2025skimodels,
title={SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living},
author={Arkaprava Sinha and Dominick Reilly and Francois Bremond and Pu Wang and Srijan Das},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)},
year={2025}
}
We sincerely thank the authors of ViFi-CLIP, Hyperformer, and LLAVIDAL for providing the codebases.
