Skip to main content

Multi-modal Relation Distillation for Unified 3D Representation Learning

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15091))

Included in the following conference series:

  • 1046 Accesses

  • 1 Citation

Abstract

Recent advancements in multi-modal pre-training for 3D point clouds have demonstrated promising results by aligning heterogeneous features across 3D shapes and their corresponding 2D images and language descriptions. However, current straightforward solutions often overlook intricate structural relations among samples, potentially limiting the full capabilities of multi-modal learning. To address this issue, we introduce Multi-modal Relation Distillation (MRD), a tri-modal pre-training framework, which is designed to effectively distill reputable large Vision-Language Models (VLM) into 3D backbones. MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations. Notably, MRD achieves significant improvements in downstream zero-shot classification tasks and cross-modality retrieval tasks, delivering new state-of-the-art performance.

H. Wang—For this joint project, he is also an intern with PICO ARCH.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Institutional subscriptions

Similar content being viewed by others

References

  1. Andonian, A., Chen, S., Hamid, R.: Robust cross-modal representation learning with progressive self-distillation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 16430–16441 (2022)

    Google Scholar 

  2. Armeni, I., et al.: 3d semantic parsing of large-scale indoor spaces. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1534–1543 (2016)

    Google Scholar 

  3. Cadena, C., Dick, A.R., Reid, I.D.: Multi-modal auto-encoders as joint estimators for robotics scene understanding. In: Robotics: Science and Systems, vol. 5 (2016)

    Google Scholar 

  4. Chang, A.X., et al.: Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)

  5. Chen, A., et al.: Pimae: point cloud and image interactive masked autoencoders for 3d object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5291–5301 (2023)

    Google Scholar 

  6. Chen, G., Wang, M., Yang, Y., Yu, K., Yuan, L., Yue, Y.: Pointgpt: auto-regressively generative pre-training from point clouds. Adv. Neural Inform. Process. Syst. (2023)

    Google Scholar 

  7. Collins, J., et al.: ABO: dataset and benchmarks for real-world 3d object understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 21094–21104 (2022)

    Google Scholar 

  8. Deitke, M., et al.: Objaverse: a universe of annotated 3d objects. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 13142–13153 (2023)

    Google Scholar 

  9. Dong, R., et al.: Autoencoders as cross-modal teachers: can pretrained 2d image transformers help 3d representation learning? In: International Conference on Learning Representation (2023)

    Google Scholar 

  10. Fu, H., et al.: 3d-future: 3d furniture shape with texture. Int. J. Comput. Vis. 3313–3337 (2021)

    Google Scholar 

  11. Gao, Y., et al.: Softclip: softer cross-modal alignment makes clip stronger. In: AAAI, pp. 1860–1868 (2024)

    Google Scholar 

  12. Gao, Y., et al.: Pyramidclip: hierarchical feature alignment for vision-language model pretraining. Adv. Neural Inform. Process. Syst. 35, 35959–35970 (2022)

    Google Scholar 

  13. Goel, S., Bansal, H., Bhatia, S., Rossi, R., Vinay, V., Grover, A.: Cyclip: cyclic contrastive language-image pretraining. Adv. Neural Inform. Process. Syst. 35, 6704–6719 (2022)

    Google Scholar 

  14. Guo, Z., Li, X., Heng, P.A.: Joint-mae: 2d-3d joint masked autoencoders for 3d point cloud pre-training. In: IJCAI, pp. 791–799 (2023)

    Google Scholar 

  15. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)

    Google Scholar 

  16. Hegde, D., Valanarasu, J.M.J., Patel, V.: Clip goes 3d: leveraging prompt tuning for language grounded 3d recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2028–2038 (2023)

    Google Scholar 

  17. Hoffmann, D.T., Behrmann, N., Gall, J., Brox, T., Noroozi, M.: Ranking info noise contrastive estimation: boosting contrastive learning via ranked positives. In: AAAI, vol. 36, pp. 897–905 (2022)

    Google Scholar 

  18. Huang, S., Xie, Y., Zhu, S.C., Zhu, Y.: Spatio-temporal self-supervised representation learning for 3d point clouds. In: International Conference on Computer Vision, pp. 6535–6545 (2021)

    Google Scholar 

  19. Huang, T., et al.: Clip2point: transfer clip to point cloud classification with image-depth pre-training. In: International Conference on Computer Vision, pp. 22157–22167 (2023)

    Google Scholar 

  20. Kim, B., Choi, S., Hwang, D., Lee, M., Lee, H.: Transferring pre-trained multimodal representations with cross-modal similarity matching. Adv. Neural Inform. Process. Syst. (2022)

    Google Scholar 

  21. Li, H., Zhou, X., Tuan, L.A., Miao, C.: Rethinking negative pairs in code search. In: EMNLP, pp. 12760–12774 (2023)

    Google Scholar 

  22. Li, R., Li, X., Fu, C.W., Cohen-Or, D., Heng, P.A.: Pu-gan: a point cloud upsampling adversarial network. In: International Conference on Computer Vision, pp. 7203–7212 (2019)

    Google Scholar 

  23. Li, Y., et al.: Deepfusion: lidar-camera deep fusion for multi-modal 3d object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17182–17191 (2022)

    Google Scholar 

  24. Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., Zou, J.Y.: Mind the gap: understanding the modality gap in multi-modal contrastive representation learning. Adv. Neural Inform. Process. Syst. 35, 17612–17625 (2022)

    Google Scholar 

  25. Liu, H., Simonyan, K., Yang, Y.: DARTS: differentiable architecture search. In: International Conference on Learning Representation (2019)

    Google Scholar 

  26. Liu, M., et al.: Openshape: scaling up 3d shape representation towards open-world understanding. Adv. Neural Inform. Process. Syst. (2023)

    Google Scholar 

  27. Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3d object detection via transformers. In: International Conference on Computer Vision, pp. 2949–2958 (2021)

    Google Scholar 

  28. Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pretrained models. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advance Neural Information Processing System (2023)

    Google Scholar 

  29. Ma, W., Xu, M., Li, X., Zhou, X.: Multicad: contrastive representation learning for multi-modal 3d computer-aided design models. In: ACM International Conference on Information Knowledge Management (2023)

    Google Scholar 

  30. Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: International Conference on Computer Vision, pp. 2906–2917 (2021)

    Google Scholar 

  31. Pang, Y., Wang, W., Tay, F.E.H., Liu, W., Tian, Y., Yuan, L.: Masked autoencoders for point cloud self-supervised learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part II, pp. 604–621. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_35

  32. Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3967–3976 (2019)

    Google Scholar 

  33. Poursaeed, O., Jiang, T., Qiao, H., Xu, N., Kim, V.G.: Self-supervised learning of point clouds via orientation estimation. In: 3DV, pp. 1018–1028 (2020)

    Google Scholar 

  34. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inform. Process. Syst. (2017)

    Google Scholar 

  35. Qi, Z., et al.: Contrast with reconstruct: contrastive 3d representation learning guided by generative pretraining. In: International Conference on Machine Learning (2023)

    Google Scholar 

  36. Qian, G., et al.: Pointnext: revisiting pointnet++ with improved training and scaling strategies. Adv. Neural Inform. Process. Syst. (2022)

    Google Scholar 

  37. Qian, G., Zhang, X., Hamdi, A., Ghanem, B.: Pix4point: image pretrained transformers for 3d point cloud understanding. In: 3DV, pp. 1280–1290 (2024)

    Google Scholar 

  38. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)

    Google Scholar 

  39. Rao, Y., Lu, J., Zhou, J.: Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 5376–5385 (2020)

    Google Scholar 

  40. Sauder, J., Sievers, B.: Self-supervised deep learning on point clouds by reconstructing space. Adv. Neural Inform. Process. Syst. 32, 12942–12952 (2019)

    Google Scholar 

  41. Uy, M.A., Pham, Q.H., Hua, B.S., Nguyen, T., Yeung, S.K.: Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data. In: International Conference on Computer Vision, pp. 1588–1597 (2019)

    Google Scholar 

  42. Vu, T., Kim, K., Luu, T.M., Nguyen, T., Yoo, C.D.: Softgroup for 3d instance segmentation on point clouds. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 2708–2717 (2022)

    Google Scholar 

  43. Wang, H., Liu, Q., Yue, X., Lasenby, J., Kusner, M.J.: Unsupervised point cloud pre-training via occlusion completion. In: International Conference on Computer Vision, pp. 9782–9792 (2021)

    Google Scholar 

  44. Wang, H., Huang, D., Wang, Y.: Gridnet: efficiently learning deep hierarchical representation for 3d point cloud understanding. Front. Comput. Sci. 16, 161301 (2022)

    Google Scholar 

  45. Wang, Z., et al.: Connecting multi-modal contrastive representations. Adv. Neural Inform. Process. Syst. 36 (2024)

    Google Scholar 

  46. Wojek, C., Walk, S., Roth, S., Schiele, B.: Monocular 3d scene understanding with explicit occlusion reasoning. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 1993–2000. IEEE (2011)

    Google Scholar 

  47. Wu, Z., et al.: 3d shapenets: a deep representation for volumetric shapes. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 1912–1920 (2015)

    Google Scholar 

  48. Xie, S., Gu, J., Guo, D., Qi, C.R., Guibas, L., Litany, O.: PointContrast: unsupervised pre-training for 3D point cloud understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_34

    Chapter  Google Scholar 

  49. Xue, L., et al.:: Ulip: learning a unified representation of language, images, and point clouds for 3d understanding. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 1179–1189 (2023)

    Google Scholar 

  50. Xue, L., et al.: Ulip-2: towards scalable multimodal pre-training for 3d understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (2024)

    Google Scholar 

  51. Yan, S., et al.: Implicit autoencoder for point-cloud self-supervised representation learning. In: International Conference on Computer Vision, pp. 14530–14542 (2023)

    Google Scholar 

  52. Yao, L., et al.: Filip: fine-grained interactive language-image pre-training. In: International Conference on Learning Representation (2022)

    Google Scholar 

  53. Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 11784–11793 (2021)

    Google Scholar 

  54. Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-bert: pre-training 3d point cloud transformers with masked point modeling. In: IEEE Conference on Computer Vision Pattern Recognition, pp. 19313–19322 (2022)

    Google Scholar 

  55. Yuan, X., et al.: Multimodal contrastive training for visual representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  56. Zhang, R., et al.: Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. Adv. Neural Inform. Process. Syst. 35, 27061–27074 (2022)

    Google Scholar 

  57. Zhang, R., et al.: Pointclip: point cloud understanding by clip. In: IEEE Conference Computer Vision and Pattern Recognition, pp. 8552–8562 (2022)

    Google Scholar 

  58. Zhang, R., Wang, L., Qiao, Y., Gao, P., Li, H.: Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In: IEEE Conference Computer Vision and Pattern Recognition, pp. 21769–21780 (2023)

    Google Scholar 

  59. Zhou, J., Wang, J., Ma, B., Liu, Y.S., Huang, T., Wang, X.: Uni3d: exploring unified 3d representation at scale. In: International Conference on Learning Representation (2023)

    Google Scholar 

  60. Zhu, X., et al.: Pointclip v2: prompting clip and GPT for powerful 3d open-world learning. In: International Conference on Computer Vision, pp. 2639–2650 (2023)

    Google Scholar 

Download references

Acknowledgements

This work is partly supported by the National Natural Science Foundation of China (No. 62176012 and 62022011), the Research Program of State Key Laboratory of Software Development Environment, and the Fundamental Research Funds for the Central Universities, with additional in-kind contributions from PICO ARCH.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Di Huang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1366 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, H. et al. (2025). Multi-modal Relation Distillation for Unified 3D Representation Learning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15091. Springer, Cham. https://doi.org/10.1007/978-3-031-73414-4_21

Download citation

Publish with us

Policies and ethics

Profiles

  1. Di Huang