Abstract
Recent advancements in multi-modal pre-training for 3D point clouds have demonstrated promising results by aligning heterogeneous features across 3D shapes and their corresponding 2D images and language descriptions. However, current straightforward solutions often overlook intricate structural relations among samples, potentially limiting the full capabilities of multi-modal learning. To address this issue, we introduce Multi-modal Relation Distillation (MRD), a tri-modal pre-training framework, which is designed to effectively distill reputable large Vision-Language Models (VLM) into 3D backbones. MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations. Notably, MRD achieves significant improvements in downstream zero-shot classification tasks and cross-modality retrieval tasks, delivering new state-of-the-art performance.
H. Wang—For this joint project, he is also an intern with PICO ARCH.
Access this chapter
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
Similar content being viewed by others
References
Andonian, A., Chen, S., Hamid, R.: Robust cross-modal representation learning with progressive self-distillation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 16430–16441 (2022)
Armeni, I., et al.: 3d semantic parsing of large-scale indoor spaces. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1534–1543 (2016)
Cadena, C., Dick, A.R., Reid, I.D.: Multi-modal auto-encoders as joint estimators for robotics scene understanding. In: Robotics: Science and Systems, vol. 5 (2016)
Chang, A.X., et al.: Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)
Chen, A., et al.: Pimae: point cloud and image interactive masked autoencoders for 3d object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5291–5301 (2023)
Chen, G., Wang, M., Yang, Y., Yu, K., Yuan, L., Yue, Y.: Pointgpt: auto-regressively generative pre-training from point clouds. Adv. Neural Inform. Process. Syst. (2023)
Collins, J., et al.: ABO: dataset and benchmarks for real-world 3d object understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 21094–21104 (2022)
Deitke, M., et al.: Objaverse: a universe of annotated 3d objects. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 13142–13153 (2023)
Dong, R., et al.: Autoencoders as cross-modal teachers: can pretrained 2d image transformers help 3d representation learning? In: International Conference on Learning Representation (2023)
Fu, H., et al.: 3d-future: 3d furniture shape with texture. Int. J. Comput. Vis. 3313–3337 (2021)
Gao, Y., et al.: Softclip: softer cross-modal alignment makes clip stronger. In: AAAI, pp. 1860–1868 (2024)
Gao, Y., et al.: Pyramidclip: hierarchical feature alignment for vision-language model pretraining. Adv. Neural Inform. Process. Syst. 35, 35959–35970 (2022)
Goel, S., Bansal, H., Bhatia, S., Rossi, R., Vinay, V., Grover, A.: Cyclip: cyclic contrastive language-image pretraining. Adv. Neural Inform. Process. Syst. 35, 6704–6719 (2022)
Guo, Z., Li, X., Heng, P.A.: Joint-mae: 2d-3d joint masked autoencoders for 3d point cloud pre-training. In: IJCAI, pp. 791–799 (2023)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Hegde, D., Valanarasu, J.M.J., Patel, V.: Clip goes 3d: leveraging prompt tuning for language grounded 3d recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2028–2038 (2023)
Hoffmann, D.T., Behrmann, N., Gall, J., Brox, T., Noroozi, M.: Ranking info noise contrastive estimation: boosting contrastive learning via ranked positives. In: AAAI, vol. 36, pp. 897–905 (2022)
Huang, S., Xie, Y., Zhu, S.C., Zhu, Y.: Spatio-temporal self-supervised representation learning for 3d point clouds. In: International Conference on Computer Vision, pp. 6535–6545 (2021)
Huang, T., et al.: Clip2point: transfer clip to point cloud classification with image-depth pre-training. In: International Conference on Computer Vision, pp. 22157–22167 (2023)
Kim, B., Choi, S., Hwang, D., Lee, M., Lee, H.: Transferring pre-trained multimodal representations with cross-modal similarity matching. Adv. Neural Inform. Process. Syst. (2022)
Li, H., Zhou, X., Tuan, L.A., Miao, C.: Rethinking negative pairs in code search. In: EMNLP, pp. 12760–12774 (2023)
Li, R., Li, X., Fu, C.W., Cohen-Or, D., Heng, P.A.: Pu-gan: a point cloud upsampling adversarial network. In: International Conference on Computer Vision, pp. 7203–7212 (2019)
Li, Y., et al.: Deepfusion: lidar-camera deep fusion for multi-modal 3d object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17182–17191 (2022)
Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., Zou, J.Y.: Mind the gap: understanding the modality gap in multi-modal contrastive representation learning. Adv. Neural Inform. Process. Syst. 35, 17612–17625 (2022)
Liu, H., Simonyan, K., Yang, Y.: DARTS: differentiable architecture search. In: International Conference on Learning Representation (2019)
Liu, M., et al.: Openshape: scaling up 3d shape representation towards open-world understanding. Adv. Neural Inform. Process. Syst. (2023)
Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3d object detection via transformers. In: International Conference on Computer Vision, pp. 2949–2958 (2021)
Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pretrained models. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advance Neural Information Processing System (2023)
Ma, W., Xu, M., Li, X., Zhou, X.: Multicad: contrastive representation learning for multi-modal 3d computer-aided design models. In: ACM International Conference on Information Knowledge Management (2023)
Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: International Conference on Computer Vision, pp. 2906–2917 (2021)
Pang, Y., Wang, W., Tay, F.E.H., Liu, W., Tian, Y., Yuan, L.: Masked autoencoders for point cloud self-supervised learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part II, pp. 604–621. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_35
Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3967–3976 (2019)
Poursaeed, O., Jiang, T., Qiao, H., Xu, N., Kim, V.G.: Self-supervised learning of point clouds via orientation estimation. In: 3DV, pp. 1018–1028 (2020)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inform. Process. Syst. (2017)
Qi, Z., et al.: Contrast with reconstruct: contrastive 3d representation learning guided by generative pretraining. In: International Conference on Machine Learning (2023)
Qian, G., et al.: Pointnext: revisiting pointnet++ with improved training and scaling strategies. Adv. Neural Inform. Process. Syst. (2022)
Qian, G., Zhang, X., Hamdi, A., Ghanem, B.: Pix4point: image pretrained transformers for 3d point cloud understanding. In: 3DV, pp. 1280–1290 (2024)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Rao, Y., Lu, J., Zhou, J.: Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 5376–5385 (2020)
Sauder, J., Sievers, B.: Self-supervised deep learning on point clouds by reconstructing space. Adv. Neural Inform. Process. Syst. 32, 12942–12952 (2019)
Uy, M.A., Pham, Q.H., Hua, B.S., Nguyen, T., Yeung, S.K.: Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data. In: International Conference on Computer Vision, pp. 1588–1597 (2019)
Vu, T., Kim, K., Luu, T.M., Nguyen, T., Yoo, C.D.: Softgroup for 3d instance segmentation on point clouds. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 2708–2717 (2022)
Wang, H., Liu, Q., Yue, X., Lasenby, J., Kusner, M.J.: Unsupervised point cloud pre-training via occlusion completion. In: International Conference on Computer Vision, pp. 9782–9792 (2021)
Wang, H., Huang, D., Wang, Y.: Gridnet: efficiently learning deep hierarchical representation for 3d point cloud understanding. Front. Comput. Sci. 16, 161301 (2022)
Wang, Z., et al.: Connecting multi-modal contrastive representations. Adv. Neural Inform. Process. Syst. 36 (2024)
Wojek, C., Walk, S., Roth, S., Schiele, B.: Monocular 3d scene understanding with explicit occlusion reasoning. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 1993–2000. IEEE (2011)
Wu, Z., et al.: 3d shapenets: a deep representation for volumetric shapes. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 1912–1920 (2015)
Xie, S., Gu, J., Guo, D., Qi, C.R., Guibas, L., Litany, O.: PointContrast: unsupervised pre-training for 3D point cloud understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_34
Xue, L., et al.:: Ulip: learning a unified representation of language, images, and point clouds for 3d understanding. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 1179–1189 (2023)
Xue, L., et al.: Ulip-2: towards scalable multimodal pre-training for 3d understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (2024)
Yan, S., et al.: Implicit autoencoder for point-cloud self-supervised representation learning. In: International Conference on Computer Vision, pp. 14530–14542 (2023)
Yao, L., et al.: Filip: fine-grained interactive language-image pre-training. In: International Conference on Learning Representation (2022)
Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 11784–11793 (2021)
Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-bert: pre-training 3d point cloud transformers with masked point modeling. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 19313–19322 (2022)
Yuan, X., et al.: Multimodal contrastive training for visual representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition (2021)
Zhang, R., et al.: Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. Adv. Neural Inform. Process. Syst. 35, 27061–27074 (2022)
Zhang, R., et al.: Pointclip: point cloud understanding by clip. In: IEEE Conference Computer Vision and Pattern Recognition, pp. 8552–8562 (2022)
Zhang, R., Wang, L., Qiao, Y., Gao, P., Li, H.: Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In: IEEE Conference Computer Vision and Pattern Recognition, pp. 21769–21780 (2023)
Zhou, J., Wang, J., Ma, B., Liu, Y.S., Huang, T., Wang, X.: Uni3d: exploring unified 3d representation at scale. In: International Conference on Learning Representation (2023)
Zhu, X., et al.: Pointclip v2: prompting clip and GPT for powerful 3d open-world learning. In: International Conference on Computer Vision, pp. 2639–2650 (2023)
Acknowledgements
This work is partly supported by the National Natural Science Foundation of China (No. 62176012 and 62022011), the Research Program of State Key Laboratory of Software Development Environment, and the Fundamental Research Funds for the Central Universities, with additional in-kind contributions from PICO ARCH.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, H. et al. (2025). Multi-modal Relation Distillation for Unified 3D Representation Learning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15091. Springer, Cham. https://doi.org/10.1007/978-3-031-73414-4_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-73414-4_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73413-7
Online ISBN: 978-3-031-73414-4
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science

