{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,22]],"date-time":"2026-04-22T19:24:00Z","timestamp":1776885840859,"version":"3.51.2"},"reference-count":102,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,7,19]],"date-time":"2024-07-19T00:00:00Z","timestamp":1721347200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Bavarian State Ministry of Science and the Arts"},{"name":"ERC Starting Grant","award":["SpatialSem (101076253)"],"award-info":[{"award-number":["SpatialSem (101076253)"]}]},{"name":"German Research Foundation (DFG) Grant","award":["Learning How to Interact with Scenes through Part-Based Understanding"],"award-info":[{"award-number":["Learning How to Interact with Scenes through Part-Based Understanding"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Graph."],"published-print":{"date-parts":[[2024,7,19]]},"abstract":"<jats:p>Perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive yet imperfect annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task - both in depth-scale ambiguity in monocular perception, as well as inexact matches of CAD database models to real observations. We thus propose DiffCAD, the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image. We learn a probabilistic model through diffusion, modeling likely distributions of shape, pose, and scale of CAD objects in an image. This enables multi-hypothesis generation of different plausible CAD reconstructions, requiring only a few hypotheses to characterize ambiguities in depth\/scale and inexact shape matches. Our approach is trained only on synthetic data, leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains. Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses.<\/jats:p>","DOI":"10.1145\/3658236","type":"journal-article","created":{"date-parts":[[2024,7,19]],"date-time":"2024-07-19T14:47:57Z","timestamp":1721400477000},"page":"1-15","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":16,"title":["DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image"],"prefix":"10.1145","volume":"43","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0458-8107","authenticated-orcid":false,"given":"Daoyi","family":"Gao","sequence":"first","affiliation":[{"name":"Technical University of Munich, Munich, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8568-4960","authenticated-orcid":false,"given":"David","family":"Rozenberszki","sequence":"additional","affiliation":[{"name":"Technical University of Munich, Munich, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7998-3737","authenticated-orcid":false,"given":"Stefan","family":"Leutenegger","sequence":"additional","affiliation":[{"name":"Technical University of Munich, Munich, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6241-8782","authenticated-orcid":false,"given":"Angela","family":"Dai","sequence":"additional","affiliation":[{"name":"Technical University of Munich, Munich, Germany"}]}],"member":"320","published-online":{"date-parts":[[2024,7,19]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00272"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00264"},{"key":"e_1_2_2_3_1","volume-title":"Proceedings, Part XXII 16","author":"Avetisyan Armen","year":"2020","unstructured":"Armen Avetisyan, Tatiana Khanova, Christopher Choy, Denver Dash, Angela Dai, and Matthias Nie\u00dfner. 2020. Scenecad: Predicting object alignments and layouts in rgb-d scans. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXII 16. Springer, 596--612."},{"key":"e_1_2_2_4_1","volume-title":"IronDepth: Iterative Refinement of Single-View Depth using Surface Normal and its Uncertainty. ArXiv abs\/2210.03676","author":"Bae Gwangbin","year":"2022","unstructured":"Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. 2022. IronDepth: Iterative Refinement of Single-View Depth using Surface Normal and its Uncertainty. ArXiv abs\/2210.03676 (2022). https:\/\/api.semanticscholar.org\/CorpusID:252762221"},{"key":"e_1_2_2_5_1","volume-title":"Label-Efficient Semantic Segmentation with Diffusion Models. ArXiv abs\/2112.03126","author":"Baranchuk Dmitry","year":"2021","unstructured":"Dmitry Baranchuk, Ivan Rubachev, Andrey Voynov, Valentin Khrulkov, and Artem Babenko. 2021. Label-Efficient Semantic Segmentation with Diffusion Models. ArXiv abs\/2112.03126 (2021). https:\/\/api.semanticscholar.org\/CorpusID:244908617"},{"key":"e_1_2_2_6_1","volume-title":"Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). https:\/\/openreview.net\/forum?id=tjZjv_qh_CE","author":"Baruch Gilad","year":"2021","unstructured":"Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. 2021. ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). https:\/\/openreview.net\/forum?id=tjZjv_qh_CE"},{"key":"e_1_2_2_7_1","volume-title":"Weakly-Supervised End-to-End CAD Retrieval to Scan Objects. ArXiv abs\/2203.12873","author":"Beyer Tim","year":"2022","unstructured":"Tim Beyer and Angela Dai. 2022. Weakly-Supervised End-to-End CAD Retrieval to Scan Objects. ArXiv abs\/2203.12873 (2022). https:\/\/api.semanticscholar.org\/CorpusID:247627889"},{"key":"e_1_2_2_8_1","volume-title":"AdaBins: Depth Estimation Using Adaptive Bins. 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Bhat S.","year":"2020","unstructured":"S. Bhat, Ibraheem Alhashim, and Peter Wonka. 2020. AdaBins: Depth Estimation Using Adaptive Bins. 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 4008--4017. https:\/\/api.semanticscholar.org\/CorpusID:227227779"},{"key":"e_1_2_2_9_1","volume-title":"Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288","author":"Bhat Shariq Farooq","year":"2023","unstructured":"Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M\u00fcller. 2023. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)."},{"key":"e_1_2_2_10_1","first-page":"15309","article-title":"Retrieval-augmented diffusion models","volume":"35","author":"Blattmann Andreas","year":"2022","unstructured":"Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas M\u00fcller, and Bj\u00f6rn Ommer. 2022. Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems 35 (2022), 15309--15324.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_11_1","volume-title":"End-to-End Object Detection with Transformers. ArXiv abs\/2005.12872","author":"Carion Nicolas","year":"2020","unstructured":"Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. ArXiv abs\/2005.12872 (2020). https:\/\/api.semanticscholar.org\/CorpusID:218889832"},{"key":"e_1_2_2_12_1","volume-title":"Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012","author":"Chang Angel X","year":"2015","unstructured":"Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. 2015. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)."},{"key":"e_1_2_2_13_1","volume-title":"Cohen","author":"Chen Wenhu","year":"2022","unstructured":"Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W. Cohen. 2022. Re-Imagen: Retrieval-Augmented Text-to-Image Generator. ArXiv abs\/2209.14491 (2022). https:\/\/api.semanticscholar.org\/CorpusID:252596087"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00433"},{"key":"e_1_2_2_15_1","volume-title":"DiffusionSDF: Conditional Generative Modeling of Signed Distance Functions. ArXiv abs\/2211.13757","author":"Chou Gene","year":"2022","unstructured":"Gene Chou, Yuval Bahat, and Felix Heide. 2022. DiffusionSDF: Conditional Generative Modeling of Signed Distance Functions. ArXiv abs\/2211.13757 (2022). https:\/\/api.semanticscholar.org\/CorpusID:254017862"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00215"},{"key":"e_1_2_2_17_1","volume-title":"Proceedings, Part VIII 14","author":"Choy Christopher B","year":"2016","unstructured":"Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 2016. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part VIII 14. Springer, 628--644."},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00831"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.261"},{"key":"e_1_2_2_20_1","volume-title":"2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)","author":"Deng Congyue","year":"2022","unstructured":"Congyue Deng, Chiyu Max Jiang, C. Qi, Xinchen Yan, Yin Zhou, Leonidas J. Guibas, and Drago Anguelov. 2022. NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors. 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 20637--20647. https:\/\/api.semanticscholar.org\/CorpusID:254366717"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.21105\/joss.04901"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00816"},{"key":"e_1_2_2_23_1","volume-title":"CG-HOI: Contact-Guided 3D Human-Object Interaction Generation. arXiv preprint arXiv:2311.16097","author":"Diller Christian","year":"2023","unstructured":"Christian Diller and Angela Dai. 2023. CG-HOI: Contact-Guided 3D Human-Object Interaction Generation. arXiv preprint arXiv:2311.16097 (2023)."},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01061"},{"key":"e_1_2_2_25_1","volume-title":"Hyperdiffusion: Generating implicit neural fields with weight-space diffusion. arXiv preprint arXiv:2303.17015","author":"Erko\u00e7 Ziya","year":"2023","unstructured":"Ziya Erko\u00e7, Fangchang Ma, Qi Shan, Matthias Nie\u00dfner, and Angela Dai. 2023. Hyperdiffusion: Generating implicit neural fields with weight-space diffusion. arXiv preprint arXiv:2303.17015 (2023)."},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.264"},{"key":"e_1_2_2_27_1","volume-title":"2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)","author":"Fang Yuxin","year":"2022","unstructured":"Yuxin Fang, Wen Wang, Binhui Xie, Quan-Sen Sun, Ledell Yu Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. 2022. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 19358--19369. https:\/\/api.semanticscholar.org\/CorpusID:253510587"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/358669.358692"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01075"},{"key":"e_1_2_2_30_1","volume-title":"Yolox: Exceeding yolo series in","author":"Ge Zheng","year":"2021","unstructured":"Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. 2021. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021)."},{"key":"e_1_2_2_31_1","doi-asserted-by":"crossref","unstructured":"Golnaz Ghiasi Xiuye Gu Yin Cui and Tsung-Yi Lin. 2022. Scaling Open-Vocabulary Image Segmentation with Image-Level Labels. In ECCV.","DOI":"10.1007\/978-3-031-20059-5_31"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00988"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00847"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00399"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00509"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.322"},{"key":"e_1_2_2_37_1","volume-title":"Unsupervised Semantic Correspondence Using Stable Diffusion. arXiv preprint arXiv:2305.15581","author":"Hedlin Eric","year":"2023","unstructured":"Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi. 2023. Unsupervised Semantic Correspondence Using Stable Diffusion. arXiv preprint arXiv:2305.15581 (2023)."},{"key":"e_1_2_2_38_1","unstructured":"Jonathan Ho Ajay Jain and P. Abbeel. 2020. Denoising Diffusion Probabilistic Models. ArXiv abs\/2006.11239 (2020). https:\/\/api.semanticscholar.org\/CorpusID:219955663"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA46639.2022.9811799"},{"key":"e_1_2_2_40_1","volume-title":"Proceedings, Part XIII 16","author":"Ishimtsev Vladislav","year":"2020","unstructured":"Vladislav Ishimtsev, Alexey Bokhovkin, Alexey Artemov, Savva Ignatyev, Matthias Niessner, Denis Zorin, and Evgeny Burnaev. 2020. Cad-deform: Deformable fitting of cad models to 3d scans. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XIII 16. Springer, 599--628."},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.260"},{"key":"e_1_2_2_42_1","first-page":"1","article-title":"Acquiring 3d indoor environments with variability and repetition","volume":"31","author":"Kim Young Min","year":"2012","unstructured":"Young Min Kim, Niloy J Mitra, Dong-Ming Yan, and Leonidas Guibas. 2012. Acquiring 3d indoor environments with variability and repetition. ACM Transactions on Graphics (TOG) 31, 6 (2012), 1--11.","journal-title":"ACM Transactions on Graphics (TOG)"},{"key":"e_1_2_2_43_1","volume-title":"Variational Diffusion Models. ArXiv abs\/2107.00630","author":"Kingma Diederik P.","year":"2021","unstructured":"Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. 2021. Variational Diffusion Models. ArXiv abs\/2107.00630 (2021). https:\/\/api.semanticscholar.org\/CorpusID:235694314"},{"key":"e_1_2_2_44_1","doi-asserted-by":"crossref","unstructured":"Alexander Kirillov Eric Mintun Nikhila Ravi Hanzi Mao Chloe Rolland Laura Gustafson Tete Xiao Spencer Whitehead Alexander C Berg Wan-Yen Lo et al. 2023. Segment anything. arXiv preprint arXiv:2304.02643 (2023).","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"e_1_2_2_45_1","volume-title":"Minh Hoai Nguyen, and Minhyuk Sung","author":"Koo Juil","year":"2023","unstructured":"Juil Koo, Seungwoo Yoo, Minh Hoai Nguyen, and Minhyuk Sung. 2023. SALAD: Part-Level Latent Diffusion for 3D Shape Generation and Manipulation. ArXiv abs\/2303.12236 (2023). https:\/\/api.semanticscholar.org\/CorpusID:257663544"},{"key":"e_1_2_2_46_1","volume-title":"Proceedings, Part III 16","author":"Kuo Weicheng","year":"2020","unstructured":"Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, and Angela Dai. 2020. Mask2cad: 3d shape prediction by learning to segment and retrieve. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part III 16. Springer, 260--277."},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01236"},{"key":"e_1_2_2_48_1","volume-title":"SPARC: Sparse Render-and-Compare for CAD model alignment in a single RGB image. arXiv preprint arXiv:2210.01044","author":"Langer Florian","year":"2022","unstructured":"Florian Langer, Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. 2022. SPARC: Sparse Render-and-Compare for CAD model alignment in a single RGB image. arXiv preprint arXiv:2210.01044 (2022)."},{"key":"e_1_2_2_49_1","volume-title":"Sparse Multi-Object Render-and-Compare. arXiv preprint arXiv:2310.11184","author":"Langer Florian","year":"2023","unstructured":"Florian Langer, Ignas Budvytis, and Roberto Cipolla. 2023. Sparse Multi-Object Render-and-Compare. arXiv preprint arXiv:2310.11184 (2023)."},{"key":"e_1_2_2_50_1","volume-title":"Language-driven Semantic Segmentation. ArXiv abs\/2201.03546","author":"Li Boyi","year":"2022","unstructured":"Boyi Li, Kilian Q. Weinberger, Serge J. Belongie, Vladlen Koltun, and Ren\u00e9 Ranftl. 2022c. Language-driven Semantic Segmentation. ArXiv abs\/2201.03546 (2022). https:\/\/api.semanticscholar.org\/CorpusID:245836975"},{"key":"e_1_2_2_51_1","volume-title":"2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Li Muheng","year":"2022","unstructured":"Muheng Li, Yueqi Duan, Jie Zhou, and Jiwen Lu. 2022b. Diffusion-SDF: Text-to-Shape via Voxelized Diffusion. 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 12642--12651. https:\/\/api.semanticscholar.org\/CorpusID:254366593"},{"key":"e_1_2_2_52_1","volume-title":"Computer graphics forum","author":"Li Yangyan","unstructured":"Yangyan Li, Angela Dai, Leonidas Guibas, and Matthias Nie\u00dfner. 2015. Database-assisted object retrieval for real-time 3d reconstruction. In Computer graphics forum, Vol. 34. Wiley Online Library, 435--446."},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11633-023-1458-0"},{"key":"e_1_2_2_54_1","volume-title":"Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP. 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Liang Feng","year":"2022","unstructured":"Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, P\u00e9ter Vajda, and Diana Marculescu. 2022. Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP. 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 7061--7070. https:\/\/api.semanticscholar.org\/CorpusID:252780581"},{"key":"e_1_2_2_55_1","volume-title":"2023 IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV)","author":"Lin Kai-En","year":"2022","unstructured":"Kai-En Lin, Yen-Chen Lin, Wei-Sheng Lai, Tsung-Yi Lin, Yichang Shih, and Ravi Ramamoorthi. 2022. Vision Transformer for NeRF-Based View Synthesis from a Single Input Image. 2023 IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV) (2022), 806--815. https:\/\/api.semanticscholar.org\/CorpusID:250450901"},{"key":"e_1_2_2_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.106"},{"key":"e_1_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00187"},{"key":"e_1_2_2_58_1","volume-title":"Towards High-Fidelity Single-view Holistic Reconstruction of Indoor Scenes. ArXiv abs\/2207.08656","author":"Liu Haolin","year":"2022","unstructured":"Haolin Liu, Yujian Zheng, Guanying Chen, Shuguang Cui, and Xiaoguang Han. 2022. Towards High-Fidelity Single-view Holistic Reconstruction of Indoor Scenes. ArXiv abs\/2207.08656 (2022). https:\/\/api.semanticscholar.org\/CorpusID:250627520"},{"key":"e_1_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298965"},{"key":"e_1_2_2_60_1","volume-title":"Aleksander Holynski, and Trevor Darrell.","author":"Luo Grace","year":"2023","unstructured":"Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski, and Trevor Darrell. 2023. Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence. In Advances in Neural Information Processing Systems."},{"key":"e_1_2_2_61_1","volume-title":"3D-LMNet: Latent Embedding Matching for Accurate and Diverse 3D Point Cloud Reconstruction from a Single Image. ArXiv abs\/1807.07796","author":"Mandikal Priyanka","year":"2018","unstructured":"Priyanka Mandikal, L. NavaneetK., Mayank Agarwal, and R. Venkatesh Babu. 2018. 3D-LMNet: Latent Embedding Matching for Accurate and Diverse 3D Point Cloud Reconstruction from a Single Image. ArXiv abs\/1807.07796 (2018). https:\/\/api.semanticscholar.org\/CorpusID:49905039"},{"key":"e_1_2_2_62_1","volume-title":"Vid2cad: Cad model alignment using multi-view constraints from videos","author":"Maninis Kevis-Kokitsi","year":"2022","unstructured":"Kevis-Kokitsi Maninis, Stefan Popov, Matthias Nie\u00dfner, and Vittorio Ferrari. 2022. Vid2cad: Cad model alignment using multi-view constraints from videos. IEEE transactions on pattern analysis and machine intelligence 45, 1 (2022), 1320--1327."},{"key":"e_1_2_2_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00459"},{"key":"e_1_2_2_64_1","volume-title":"3D-LDM: Neural Implicit 3D Shape Generation with Latent Diffusion Models. ArXiv abs\/2212.00842","author":"Nam Gimin","year":"2022","unstructured":"Gimin Nam, Mariem Khlifi, Andrew Rodriguez, Alberto Tono, Linqi Zhou, and Paul Guerrero. 2022. 3D-LDM: Neural Implicit 3D Shape Generation with Latent Diffusion Models. ArXiv abs\/2212.00842 (2022). https:\/\/api.semanticscholar.org\/CorpusID:254220714"},{"key":"e_1_2_2_65_1","doi-asserted-by":"publisher","DOI":"10.1145\/2366145.2366156"},{"key":"e_1_2_2_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW56347.2022.00501"},{"key":"e_1_2_2_67_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00013"},{"key":"e_1_2_2_68_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.01006"},{"key":"e_1_2_2_69_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.00387"},{"key":"e_1_2_2_70_1","volume-title":"Proceedings, Part III 16","author":"Peng Songyou","year":"2020","unstructured":"Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger. 2020. Convolutional occupancy networks. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part III 16. Springer, 523--540."},{"key":"e_1_2_2_71_1","volume-title":"SharpNet: Fast and Accurate Recovery of Occluding Contours in Monocular Depth Estimation. 2019 IEEE\/CVF International Conference on Computer Vision Workshop (ICCVW)","author":"Ramamonjisoa Michael","year":"2019","unstructured":"Michael Ramamonjisoa and Vincent Lepetit. 2019. SharpNet: Fast and Accurate Recovery of Occluding Contours in Monocular Depth Estimation. 2019 IEEE\/CVF International Conference on Computer Vision Workshop (ICCVW) (2019), 2109--2118. https:\/\/api.semanticscholar.org\/CorpusID:160009795"},{"key":"e_1_2_2_72_1","volume-title":"Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2","author":"Ramesh Aditya","year":"2022","unstructured":"Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3."},{"key":"e_1_2_2_73_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.01196"},{"key":"e_1_2_2_74_1","volume-title":"Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501","author":"Ravi Nikhila","year":"2020","unstructured":"Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. 2020. Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501 (2020)."},{"key":"e_1_2_2_75_1","doi-asserted-by":"crossref","unstructured":"Robin Rombach Andreas Blattmann Dominik Lorenz Patrick Esser and Bj\u00f6rn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_2_2_76_1","doi-asserted-by":"crossref","unstructured":"Olga Russakovsky Jia Deng Hao Su Jonathan Krause Sanjeev Satheesh Sean Ma Zhiheng Huang Andrej Karpathy Aditya Khosla Michael Bernstein et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (2015) 211--252.","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_2_2_77_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2023.3277785"},{"key":"e_1_2_2_78_1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2366145.2366155","article-title":"An interactive approach to semantic modeling of indoor scenes with an rgbd camera","volume":"31","author":"Shao Tianjia","year":"2012","unstructured":"Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and Baining Guo. 2012. An interactive approach to semantic modeling of indoor scenes with an rgbd camera. ACM Transactions on Graphics (TOG) 31, 6 (2012), 1--11.","journal-title":"ACM Transactions on Graphics (TOG)"},{"key":"e_1_2_2_79_1","volume-title":"Knn-diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849","author":"Sheynin Shelly","year":"2022","unstructured":"Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, and Yaniv Taigman. 2022. Knn-diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849 (2022)."},{"key":"e_1_2_2_80_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.02000"},{"key":"e_1_2_2_81_1","volume-title":"Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ArXiv abs\/1503.03585","author":"Sohl-Dickstein Jascha Narain","year":"2015","unstructured":"Jascha Narain Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ArXiv abs\/1503.03585 (2015). https:\/\/api.semanticscholar.org\/CorpusID:14888175"},{"key":"e_1_2_2_82_1","volume-title":"Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems 32","author":"Song Yang","year":"2019","unstructured":"Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems 32 (2019)."},{"key":"e_1_2_2_83_1","volume-title":"Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole.","author":"Song Yang","year":"2020","unstructured":"Yang Song, Jascha Narain Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2020. Score-Based Generative Modeling through Stochastic Differential Equations. ArXiv abs\/2011.13456 (2020). https:\/\/api.semanticscholar.org\/CorpusID:227209335"},{"key":"e_1_2_2_84_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-018-1126-y"},{"key":"e_1_2_2_85_1","volume-title":"Cheng Perng Phoo, and Bharath Hariharan","author":"Tang Luming","year":"2023","unstructured":"Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. 2023. Emergent Correspondence from Image Diffusion. arXiv preprint arXiv:2306.03881 (2023)."},{"key":"e_1_2_2_86_1","volume-title":"Human motion diffusion model. arXiv preprint arXiv:2209.14916","author":"Tevet Guy","year":"2022","unstructured":"Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. 2022. Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)."},{"key":"e_1_2_2_87_1","volume-title":"Proceedings, Part VII 16","author":"Uy Mikaela Angelina","year":"2020","unstructured":"Mikaela Angelina Uy, Jingwei Huang, Minhyuk Sung, Tolga Birdal, and Leonidas Guibas. 2020. Deformation-aware 3d model embedding and retrieval. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part VII 16. Springer, 397--413."},{"key":"e_1_2_2_88_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01154"},{"key":"e_1_2_2_89_1","doi-asserted-by":"publisher","DOI":"10.1109\/BESC48373.2019.8963264"},{"key":"e_1_2_2_90_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00275"},{"key":"e_1_2_2_91_1","volume-title":"Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images. ArXiv abs\/1804.01654","author":"Wang Nanyang","year":"2018","unstructured":"Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, W. Liu, and Yu-Gang Jiang. 2018. Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images. ArXiv abs\/1804.01654 (2018). https:\/\/api.semanticscholar.org\/CorpusID:4633214"},{"key":"e_1_2_2_92_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA46639.2022.9811568"},{"key":"e_1_2_2_93_1","volume-title":"Point transformer v3: Simpler, faster, stronger. arXiv preprint arXiv:2312.10035","author":"Wu Xiaoyang","year":"2023","unstructured":"Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. 2023. Point transformer v3: Simpler, faster, stronger. arXiv preprint arXiv:2312.10035 (2023)."},{"key":"e_1_2_2_94_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00289"},{"key":"e_1_2_2_95_1","volume-title":"2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Yu Alex","year":"2020","unstructured":"Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. 2020. pixelNeRF: Neural Radiance Fields from One or Few Images. 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 4576--4585. https:\/\/api.semanticscholar.org\/CorpusID:227254854"},{"key":"e_1_2_2_96_1","first-page":"27469","article-title":"Category-level 6d object pose estimation in the wild: A semi-supervised learning approach and a new dataset","volume":"35","author":"Ze Yanjie","year":"2022","unstructured":"Yanjie Ze and Xiaolong Wang. 2022. Category-level 6d object pose estimation in the wild: A semi-supervised learning approach and a new dataset. Advances in Neural Information Processing Systems 35 (2022), 27469--27483.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_97_1","volume-title":"LION: Latent Point Diffusion Models for 3D Shape Generation. ArXiv abs\/2210.06978","author":"Zeng Xiaohui","year":"2022","unstructured":"Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. 2022. LION: Latent Point Diffusion Models for 3D Shape Generation. ArXiv abs\/2210.06978 (2022). https:\/\/api.semanticscholar.org\/CorpusID:252872881"},{"key":"e_1_2_2_98_1","doi-asserted-by":"publisher","DOI":"10.1145\/3618342"},{"key":"e_1_2_2_99_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00872"},{"key":"e_1_2_2_100_1","volume-title":"Generative Category-level Object Pose Estimation via Diffusion Models. Advances in Neural Information Processing Systems 36","author":"Zhang Jiyao","year":"2024","unstructured":"Jiyao Zhang, Mingdong Wu, and Hao Dong. 2024. Generative Category-level Object Pose Estimation via Diffusion Models. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_2_2_101_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00068"},{"key":"e_1_2_2_102_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00577"}],"container-title":["ACM Transactions on Graphics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3658236","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3658236","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:04:16Z","timestamp":1750291456000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3658236"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,19]]},"references-count":102,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,7,19]]}},"alternative-id":["10.1145\/3658236"],"URL":"https:\/\/doi.org\/10.1145\/3658236","relation":{},"ISSN":["0730-0301","1557-7368"],"issn-type":[{"value":"0730-0301","type":"print"},{"value":"1557-7368","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,19]]},"assertion":[{"value":"2024-07-19","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}