SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction [ACL 2025]
This repository contains research code for the paper SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction.
We introduce SHuBERT (Sign Hidden-Unit BERT), a self-supervised contextual representation model learned from approximately 1,000 hours of American Sign Language video. SHuBERT adapts masked token prediction objectives to multi-stream visual sign language input, learning to predict multiple targets corresponding to clustered hand, face, and body pose streams. SHuBERT achieves state-of-the-art performance across multiple tasks including sign language translation, isolated sign language recognition, and fingerspelling detection.
We provide installation and inference instructions in QUICKSTART.md.
We describe how to prepare the datasets in DATASETS.md.
Please download the weight of SHuBERT (as well as the DINO Face and Hand) weights link.
We describe how to extract features from the pretrained model in FEATURES.md.
- sbatch train_shubert.sh
TODO
If you find our work useful in your research, please consider citing:
@inproceedings{gueuwou-etal-2025-shubert,
title = "SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction",
author = "Gueuwou, Shester and Du, Xiaodan and Shakhnarovich, Greg and Livescu, Karen and Liu, Alexander H.",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
}This codebase is heavily influenced by the DinoSR and Fairseq repositories.
This project is primarily under the MIT license.
