We introduce TriAdapter Multi-Modal Learning (TAMM) -- a novel two-stage learning approach based on three synergetic adapters. First, our CLIP Image Adapter mitigates the domain gap between 3D-rendered images and natural images, by adapting the visual representations of CLIP for synthetic image-text pairs. Subsequently, our Dual Adapters decouple the 3D shape representation space into two complementary sub-spaces: one focusing on visual attributes and the other for semantic understanding, which ensure a more comprehensive and effective multi-modal pre-training.
We are committed to open-sourcing TAMM related materials, including:
- Evaluation code
- Evaluation data
- Pretraining code
- Pretrained checkpoints
- Downstream tasks implementation
Clone this repository and install the required packages:
conda create -n tamm python=3.9
conda activate tamm
conda install pytorch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 pytorch-cuda=11.6 -c pytorch -c nvidia
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine
conda install -c dglteam/label/cu113 dgl
pip install huggingface_hub tqdm
| Model | Training Data | Objaverse-LVIS Top1 (Top5) | ModelNet40 Top1 (Top5) | ScanObjectNN Top1 (Top5) |
|---|---|---|---|---|
| PointBert | Ensembled w/o LVIS | 43.5 (72.3) | 86.2 (97.9) | 55.9 (88.2) |
| PointBert | Ensembled | 51.9 (81.3) | 86.1 (97.8) | 57.0 (86.8) |
| PointBert | ShapeNet | 13.7 (29.2) | 73.2 (91.8) | 54.3 (83.6) |
- Please refer to here for pre-train dataset preparation and put it in the data folder. The folder should look like this:
├── data
│ ├── objaverse_processed
│ │ ├── merged_for_training_all
│ │ │ ├── ...
│ meta_data
- Run the pre-training stage1 by the following command:
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port=29001 main.py --config configs/clip_image_adapter_training.yaml
- Run the pre-training stage2 by the following command:
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port=29001 main.py --config configs/pointbert.yaml
Run the zero-shot evaluation by the following command
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port=29001 test.py --config configs/Pre-training/pointbert.yaml --resume /path/to/pre-trained-models
TAMM is built using the awesome OpenCLIP, ULIP, OpenShape and Uni3D.
@article{zhang2024tamm,
title={TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding},
author={Zhang, Zhihao and Cao, Shengcao and Wang, Yu-Xiong},
journal={arXiv preprint arXiv:2402.18490},
year={2024}
}