Zeqian Li, Qirui Chen, Tengda Han, Ya Zhang, Yanfeng Wang, Weidi Xie
[project page] [Arxiv] [Dataset]
conda create --name align python=3.10.0
conda activate align
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt
wandb login ***- HowToStep: An automatically generated dataset that transforms ASR transcripts into descriptive steps by prompting the LLM and then aligns steps to the video through a two-stage determination procedure.
- HowTo100M: WhisperX ASR output and InternVideo & CLIP-L14 visual features for HowTo100M.
- HTM-Align: A manually annotated 80-video subset for narration alignment evaluation.
- HT-Step: A collection of temporal annotations on videos from the HowTo100M dataset for procedural step grounding evaluation.
Some of the main components are
- [./configs]: Configs for training or inference.
- [./src/data]: Scripts for feature extraction.
- [./src/dataset]: Data loader.
- [./src/model]: Our main model with all its building blocks.
- [./src/trainer]: Our code for training and inference.
- [./src/utils]: Utility functions for training, inference, and visualization.
In [data], we provide the feature extraction script for extracting visual features and textual features using InternVideo-MM-L14.
Modify the paths of raw video, HowTo100M, HTM-Align, HT-step in the file, and the visual or textual features can be extracted via:
python ./src/data/extract_textualfeature.py
python ./src/data/extract_visualfeature.py
Modify the paths in dataset.py, mainly the paths for visual/textual features and the annotation files. For the procedural step grounding task, we set 'text_shuffle' to True and 'text_pe' to False in htm.yaml; while for the narration alignment task, we set 'text_shuffle' to False and 'text_pe' to True.
The training command is:
python main.py --gpu 0 --config_file configs/htm.yaml --run_name train
During inference, only need to modify the 'checkpoint' in htm.yaml to the path of the trained model. The settings for 'text_shuffle' and 'text_pe' are the same as during training.
The inference command is:
python main.py --gpu 0 --config_file configs/htm.yaml --run_name eval
[Optional] Evaluating Our Pre-trained Model
We also provide pre-trained models for HTM-Align and HT-Step. The model with all training logs can be downloaded from NSVA.
The results should be
| Model Name | Task | Evaluation Dataset | Performance |
|---|---|---|---|
| NSVA_narration.pth | Narration Alignment | HTM-Align | 69.8 |
| NSVA_step.pth | Procedural Step Grounding | HT-Step | 47.0 |
If you are using our code, please consider citing our paper.
@inproceedings{li2024multi,
title={Multi-Sentence Grounding for Long-term Instructional Video},
author={Li, Zeqian and Chen, Qirui and Han, Tengda and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
booktitle={European Conference on Computer Vision},
pages={200--216},
year={2024},
organization={Springer}
}If you have any question, please feel free to contact lzq0103@sjtu.edu.cn.