This repository contains the official implementation of "The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion".
Language of Motion (LoM) is a framework that models human motion generation as a sequence modeling problem using language models. It decomposes the human body into separate regions (face, hands, upper, and lower body) to effectively capture and generate natural human movements from various modalities such as text and audio.
- Initial code release
- Inference code for text-to-motion
- Inference code for co-speech gesture generation
- Tokenizer training code
- AMASS and LibriSpeech preprocessing code
- Evaluation Benchmark results
- Text-to-motion Result on rotation format
- Language model training code
Details
We use Conda for environment management. Follow these steps to set up the development environment:
# Create and activate the conda environment
conda create --name lom -y python=3.10
conda activate lom
# Install PyTorch with CUDA support
conda install pytorch==2.4.0 torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
# Alternative for RTX 5090 users: install pytorch by following way
# pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
# Install pip and dependencies
python -m pip install pip==21.3
pip install -r requirements.txt
# Install additional packages
pip install turbot5 -U
# Alternative for RTX 5090 users: upgrade triton to support the new architecture
# pip install --upgrade "git+https://github.com/openai/triton.git@main#egg=triton&subdirectory=python"
# export TRITON_JIT_CUDA_ARCHITECTURES=$(
# python - <<'EOF'
# import torch
# p = torch.cuda.get_device_properties(0)
# print(f"{p.major}{p.minor}")
# EOF
# )
# Install NLP tools
python -m spacy download en_core_web_sm
# Set up fairseq (required for some components)
mkdir -p third_party
cd third_party
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
cd ../..
# Version Conflict
pip install --upgrade "omegaconf>=2.2,<2.4" "hydra-core>=1.3,<1.4"We use TEMOS for rendering. Install it with our provided script:
# Execute the setup script to install Blender and its dependencies
chmod +x setup_blender.sh
./setup_blender.shThis script will:
- Download and extract Blender 2.93.18
- Verify the Blender Python path
- Install all necessary Python packages for rendering
Details
Please register an account on the Max Planck Institute for Intelligent Systems (MPI-IS) website to access the necessary SMPLX models. Then download the SMPLX models, Hubert, T5, and T2M metrics computation checkpoints by running the following script:
chmod +x build_resources.sh
./build_resources.shAfter running the script, you will have the following directory structure:
model_files/
βββ hubert_models/ # Hubert audio tokenizer models
βββ smplx_models/ # SMPLX body models
βββ FLAME2020/ # FLAME face models
βββ t2m_evaluators/ # Text-to-Motion evaluation metrics
βββ t5_models/ # T5 language models
Details
Pretrained models are gradually uploading! Visit the Hugging Face repository to download them.
Text-to-Motion Generation
python demo.py --cfg configs/demo_text2motion.yaml --text examples/text2motion.txt --task text2motion --renderCo-speech Gesture Generation
python demo.py --cfg configs/demo_cospeech.yaml --audio examples/2_scott_0_111_111.wav --task cospeech --renderAfter running the demo scripts, the generated motion results (including rendered videos and motion data) will be saved in the ./results directory. For text-to-motion generation, you'll find the motion sequences in .npz format and rendered videos in .mp4 format. For co-speech gesture generation, the results will include synchronized motion and audio in a single video file.
For detailed instructions on data preparation and preprocessing, please refer to the Datasets Guide.
1. Compositional Motion Tokenization (VQ-VAE Training)
This stage trains separate VQ-VAE models for different body regions. From our experiments, we found that using 256 codebook dim with a 512 codebook size yields better performance for the face, hands, and upper body, while the lower body performs better with 128 codebook dim and a 512 codebook size. Accordingly, we provide this configuration file for reference. For detailed training procedures, metrics, and troubleshooting, see the Compositional Motion Tokenization Guide.
Quick Start Commands:
Face Region:
python -m train --cfg configs/config_mixed_stage1_vq_face_256_512_ds4_wo_mesh_lr1e-4.yaml --nodebugUpper Body Region:
python -m train --cfg configs/config_mixed_stage1_vq_upper_256_512_ds4_wo_mesh_lr1e-4.yaml --nodebugLower Body Region:
python -m train --cfg configs/config_mixed_stage1_vq_lower_128_512_ds4_wo_mesh_lr1e-4.yaml --nodebugHand Region:
python -m train --cfg configs/config_mixed_stage1_vq_hand_256_512_ds4_wo_mesh_lr1e-4.yaml --nodebugGlobal(The compositional tokenizer didn't include any golbal information or global translation, so we still need a global translation predictor, this part are heavliy borrowed from EMAGE ):
python -m train --cfg configs/config_mixed_stage1_vae_global_wo_mesh_lr1e-4.yaml --nodebugOnce we finish the compositional tokenizer training, we will get 5 checkpoints for face, hand, upper, lower and global translation. we can convert the whole BEAT2 and AMASS dataset through following;
python -m scripts.get_compositional_motion_code --cfg configs/config_mixed_stage1_vq_compositional.yamlNOTE: Update the following fields in
config_mixed_stage1_vq_compositional.yaml:
CHECKPOINTS_FACECHECKPOINTS_HANDCHECKPOINTS_UPPERCHECKPOINTS_LOWERcode_numcodebook_sizeReplace them with your own checkpoints.
We also provide pretrained checkpoints on Hugging Face.All checkpoints reported here were trained on AMASS and BEAT2 datasets to ensure stronger performance. If you want to reproduce the result shown in paper, please use the checkpoint provided from EMAGE that only trained on beat2 speaker2 only, which is only used for metrics computation to gurantee the fairness with other methods.
The result will be saved at: #TOKENS_DS4#
Here you can use the provided script to compare the originial sequence and the reconstructed motion.
python -m scripts.inference_compositional_motion_code --cfg configs/config_mixed_stage1_vq_compositional.yamlthe npz files and rendering result will be generated within the
Audio Tokenizer, in this work we choose Hubert as our audio tokenizer, while we used the original version of Hubert provided here, which is uncommon at this stage. We recommand the newer verision of hubert with higher compatibilityγ
python -m scripts.get_speech_code_beat2 --beat2_root "/path/to/your/beat2"python -m scripts.get_speech_code_librispeech --data_path "/path/to/your/librispeech"2. Language Model Pretraining
Pretrained on BEAT2 speaker2 only, used exclusively for fair comparison. This version uses tokenizers and datasets trained only on BEAT2 speaker2:
python -m train --cfg configs/config_mixed_stage2_speaker2.yaml --nodebugNormal Version - Can be trained on large scale datasets without numerical comparison constraints:
python -m train --cfg configs/config_mixed_stage2.yaml --nodebug3. Task-Specific Fine-tuning
Text-to-motion
python -m train --cfg configs/config_mixed_stage3_t2m.yaml --nodebugAudio-to-motion
python -m train --cfg configs/config_mixed_stage3_a2m.yaml --nodebugStay tuned for updates on our training procedures and best practices.
To evaluate the co-speech metrics, please first update the trained model checkpoint paths in configs/config_mixed_stage3_a2m.yaml:
TEST.CHECKPOINTS_FACETEST.CHECKPOINTS_HANDTEST.CHECKPOINTS_UPPERTEST.CHECKPOINTS_LOWER
Then, run the following command:
python -m test --cfg configs/config_mixed_stage3_a2m.yamlNote: The demo checkpoint named "Instruct_Mixed_A2M_LM.ckpt" is provided for visualization purposes only. When training your own model, you will observe performance curves similar to those shown below. However, the results presented in the paper do not represent the optimal performance achievable with this framework.
If you find our work useful for your research, please consider citing:
@article{chen2024language,
title={The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion},
author={Chen, Changan and Zhang, Juze and Lakshmikanth, Shrinidhi K and Fang, Yusu and Shao, Ruizhi and Wetzstein, Gordon and Fei-Fei, Li and Adeli, Ehsan},
journal={CVPR},
year={2025}
}This project was partially funded by NIH grant R01AG089169 and UST. The authors would also like to thank Georgios Pavlakos for his valuable discussion, Chaitanya Patel, Jingyan Zhang, and Bin Li for their feedback on the paper.
