Multi-modal Relation Distillation for Unified 3D Representation Learning

Wang, Huiqun; Bao, Yiping; Pan, Panwang; Li, Zeming; Liu, Xiao; Yang, Ruijie; Huang, Di

doi:10.1007/978-3-031-73414-4_21

Huiqun Wang^13,14,
Yiping Bao¹⁵,
Panwang Pan¹⁵,
Zeming Li¹⁵,
Xiao Liu¹⁵,
Ruijie Yang^13,14 &
…
Di Huang^13,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15091))

Included in the following conference series:

European Conference on Computer Vision

1046 Accesses
1 Citation

Abstract

Recent advancements in multi-modal pre-training for 3D point clouds have demonstrated promising results by aligning heterogeneous features across 3D shapes and their corresponding 2D images and language descriptions. However, current straightforward solutions often overlook intricate structural relations among samples, potentially limiting the full capabilities of multi-modal learning. To address this issue, we introduce Multi-modal Relation Distillation (MRD), a tri-modal pre-training framework, which is designed to effectively distill reputable large Vision-Language Models (VLM) into 3D backbones. MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations. Notably, MRD achieves significant improvements in downstream zero-shot classification tasks and cross-modality retrieval tasks, delivering new state-of-the-art performance.

H. Wang—For this joint project, he is also an intern with PICO ARCH.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Institutional subscriptions

\(\hbox {I}^2\)MD: 3D Action Representation Learning with Inter- and Intra-Modal Mutual Distillation

Article 27 March 2025

M3DBench: Towards Omni 3D Assistant with Interleaved Multi-modal Instructions

MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

Article 24 December 2025

References

Andonian, A., Chen, S., Hamid, R.: Robust cross-modal representation learning with progressive self-distillation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 16430–16441 (2022)
Google Scholar
Armeni, I., et al.: 3d semantic parsing of large-scale indoor spaces. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1534–1543 (2016)
Google Scholar
Cadena, C., Dick, A.R., Reid, I.D.: Multi-modal auto-encoders as joint estimators for robotics scene understanding. In: Robotics: Science and Systems, vol. 5 (2016)
Google Scholar
Chang, A.X., et al.: Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)
Chen, A., et al.: Pimae: point cloud and image interactive masked autoencoders for 3d object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5291–5301 (2023)
Google Scholar
Chen, G., Wang, M., Yang, Y., Yu, K., Yuan, L., Yue, Y.: Pointgpt: auto-regressively generative pre-training from point clouds. Adv. Neural Inform. Process. Syst. (2023)
Google Scholar
Collins, J., et al.: ABO: dataset and benchmarks for real-world 3d object understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 21094–21104 (2022)
Google Scholar
Deitke, M., et al.: Objaverse: a universe of annotated 3d objects. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 13142–13153 (2023)
Google Scholar
Dong, R., et al.: Autoencoders as cross-modal teachers: can pretrained 2d image transformers help 3d representation learning? In: International Conference on Learning Representation (2023)
Google Scholar
Fu, H., et al.: 3d-future: 3d furniture shape with texture. Int. J. Comput. Vis. 3313–3337 (2021)
Google Scholar
Gao, Y., et al.: Softclip: softer cross-modal alignment makes clip stronger. In: AAAI, pp. 1860–1868 (2024)
Google Scholar
Gao, Y., et al.: Pyramidclip: hierarchical feature alignment for vision-language model pretraining. Adv. Neural Inform. Process. Syst. 35, 35959–35970 (2022)
Google Scholar
Goel, S., Bansal, H., Bhatia, S., Rossi, R., Vinay, V., Grover, A.: Cyclip: cyclic contrastive language-image pretraining. Adv. Neural Inform. Process. Syst. 35, 6704–6719 (2022)
Google Scholar
Guo, Z., Li, X., Heng, P.A.: Joint-mae: 2d-3d joint masked autoencoders for 3d point cloud pre-training. In: IJCAI, pp. 791–799 (2023)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Google Scholar
Hegde, D., Valanarasu, J.M.J., Patel, V.: Clip goes 3d: leveraging prompt tuning for language grounded 3d recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2028–2038 (2023)
Google Scholar
Hoffmann, D.T., Behrmann, N., Gall, J., Brox, T., Noroozi, M.: Ranking info noise contrastive estimation: boosting contrastive learning via ranked positives. In: AAAI, vol. 36, pp. 897–905 (2022)
Google Scholar
Huang, S., Xie, Y., Zhu, S.C., Zhu, Y.: Spatio-temporal self-supervised representation learning for 3d point clouds. In: International Conference on Computer Vision, pp. 6535–6545 (2021)
Google Scholar
Huang, T., et al.: Clip2point: transfer clip to point cloud classification with image-depth pre-training. In: International Conference on Computer Vision, pp. 22157–22167 (2023)
Google Scholar
Kim, B., Choi, S., Hwang, D., Lee, M., Lee, H.: Transferring pre-trained multimodal representations with cross-modal similarity matching. Adv. Neural Inform. Process. Syst. (2022)
Google Scholar
Li, H., Zhou, X., Tuan, L.A., Miao, C.: Rethinking negative pairs in code search. In: EMNLP, pp. 12760–12774 (2023)
Google Scholar
Li, R., Li, X., Fu, C.W., Cohen-Or, D., Heng, P.A.: Pu-gan: a point cloud upsampling adversarial network. In: International Conference on Computer Vision, pp. 7203–7212 (2019)
Google Scholar
Li, Y., et al.: Deepfusion: lidar-camera deep fusion for multi-modal 3d object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17182–17191 (2022)
Google Scholar
Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., Zou, J.Y.: Mind the gap: understanding the modality gap in multi-modal contrastive representation learning. Adv. Neural Inform. Process. Syst. 35, 17612–17625 (2022)
Google Scholar
Liu, H., Simonyan, K., Yang, Y.: DARTS: differentiable architecture search. In: International Conference on Learning Representation (2019)
Google Scholar
Liu, M., et al.: Openshape: scaling up 3d shape representation towards open-world understanding. Adv. Neural Inform. Process. Syst. (2023)
Google Scholar
Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3d object detection via transformers. In: International Conference on Computer Vision, pp. 2949–2958 (2021)
Google Scholar
Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pretrained models. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advance Neural Information Processing System (2023)
Google Scholar
Ma, W., Xu, M., Li, X., Zhou, X.: Multicad: contrastive representation learning for multi-modal 3d computer-aided design models. In: ACM International Conference on Information Knowledge Management (2023)
Google Scholar
Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: International Conference on Computer Vision, pp. 2906–2917 (2021)
Google Scholar
Pang, Y., Wang, W., Tay, F.E.H., Liu, W., Tian, Y., Yuan, L.: Masked autoencoders for point cloud self-supervised learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part II, pp. 604–621. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_35
Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3967–3976 (2019)
Google Scholar
Poursaeed, O., Jiang, T., Qiao, H., Xu, N., Kim, V.G.: Self-supervised learning of point clouds via orientation estimation. In: 3DV, pp. 1018–1028 (2020)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inform. Process. Syst. (2017)
Google Scholar
Qi, Z., et al.: Contrast with reconstruct: contrastive 3d representation learning guided by generative pretraining. In: International Conference on Machine Learning (2023)
Google Scholar
Qian, G., et al.: Pointnext: revisiting pointnet++ with improved training and scaling strategies. Adv. Neural Inform. Process. Syst. (2022)
Google Scholar
Qian, G., Zhang, X., Hamdi, A., Ghanem, B.: Pix4point: image pretrained transformers for 3d point cloud understanding. In: 3DV, pp. 1280–1290 (2024)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Google Scholar
Rao, Y., Lu, J., Zhou, J.: Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 5376–5385 (2020)
Google Scholar
Sauder, J., Sievers, B.: Self-supervised deep learning on point clouds by reconstructing space. Adv. Neural Inform. Process. Syst. 32, 12942–12952 (2019)
Google Scholar
Uy, M.A., Pham, Q.H., Hua, B.S., Nguyen, T., Yeung, S.K.: Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data. In: International Conference on Computer Vision, pp. 1588–1597 (2019)
Google Scholar
Vu, T., Kim, K., Luu, T.M., Nguyen, T., Yoo, C.D.: Softgroup for 3d instance segmentation on point clouds. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 2708–2717 (2022)
Google Scholar
Wang, H., Liu, Q., Yue, X., Lasenby, J., Kusner, M.J.: Unsupervised point cloud pre-training via occlusion completion. In: International Conference on Computer Vision, pp. 9782–9792 (2021)
Google Scholar
Wang, H., Huang, D., Wang, Y.: Gridnet: efficiently learning deep hierarchical representation for 3d point cloud understanding. Front. Comput. Sci. 16, 161301 (2022)
Google Scholar
Wang, Z., et al.: Connecting multi-modal contrastive representations. Adv. Neural Inform. Process. Syst. 36 (2024)
Google Scholar
Wojek, C., Walk, S., Roth, S., Schiele, B.: Monocular 3d scene understanding with explicit occlusion reasoning. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 1993–2000. IEEE (2011)
Google Scholar
Wu, Z., et al.: 3d shapenets: a deep representation for volumetric shapes. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 1912–1920 (2015)
Google Scholar
Xie, S., Gu, J., Guo, D., Qi, C.R., Guibas, L., Litany, O.: PointContrast: unsupervised pre-training for 3D point cloud understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_34
Chapter Google Scholar
Xue, L., et al.:: Ulip: learning a unified representation of language, images, and point clouds for 3d understanding. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 1179–1189 (2023)
Google Scholar
Xue, L., et al.: Ulip-2: towards scalable multimodal pre-training for 3d understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (2024)
Google Scholar
Yan, S., et al.: Implicit autoencoder for point-cloud self-supervised representation learning. In: International Conference on Computer Vision, pp. 14530–14542 (2023)
Google Scholar
Yao, L., et al.: Filip: fine-grained interactive language-image pre-training. In: International Conference on Learning Representation (2022)
Google Scholar
Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 11784–11793 (2021)
Google Scholar
Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-bert: pre-training 3d point cloud transformers with masked point modeling. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 19313–19322 (2022)
Google Scholar
Yuan, X., et al.: Multimodal contrastive training for visual representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Zhang, R., et al.: Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. Adv. Neural Inform. Process. Syst. 35, 27061–27074 (2022)
Google Scholar
Zhang, R., et al.: Pointclip: point cloud understanding by clip. In: IEEE Conference Computer Vision and Pattern Recognition, pp. 8552–8562 (2022)
Google Scholar
Zhang, R., Wang, L., Qiao, Y., Gao, P., Li, H.: Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In: IEEE Conference Computer Vision and Pattern Recognition, pp. 21769–21780 (2023)
Google Scholar
Zhou, J., Wang, J., Ma, B., Liu, Y.S., Huang, T., Wang, X.: Uni3d: exploring unified 3d representation at scale. In: International Conference on Learning Representation (2023)
Google Scholar
Zhu, X., et al.: Pointclip v2: prompting clip and GPT for powerful 3d open-world learning. In: International Conference on Computer Vision, pp. 2639–2650 (2023)
Google Scholar

Download references

Acknowledgements

This work is partly supported by the National Natural Science Foundation of China (No. 62176012 and 62022011), the Research Program of State Key Laboratory of Software Development Environment, and the Fundamental Research Funds for the Central Universities, with additional in-kind contributions from PICO ARCH.

Author information

Authors and Affiliations

SKLSDE, Beihang University, Beijing, China
Huiqun Wang, Ruijie Yang & Di Huang
IRIP Lab, SCSE, Beihang University, Beijing, China
Huiqun Wang, Ruijie Yang & Di Huang
PICO, Beijing, China
Yiping Bao, Panwang Pan, Zeming Li & Xiao Liu

Authors

Huiqun Wang
View author publications
Search author on:PubMed Google Scholar
Yiping Bao
View author publications
Search author on:PubMed Google Scholar
Panwang Pan
View author publications
Search author on:PubMed Google Scholar
Zeming Li
View author publications
Search author on:PubMed Google Scholar
Xiao Liu
View author publications
Search author on:PubMed Google Scholar
Ruijie Yang
View author publications
Search author on:PubMed Google Scholar
Di Huang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Di Huang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1366 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, H. et al. (2025). Multi-modal Relation Distillation for Unified 3D Representation Learning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15091. Springer, Cham. https://doi.org/10.1007/978-3-031-73414-4_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-73414-4_21
Published: 25 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73413-7
Online ISBN: 978-3-031-73414-4
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science

Publish with us

Policies and ethics

Profiles

Di Huang View author profile

Multi-modal Relation Distillation for Unified 3D Representation Learning