Abstract
Personalized 3D avatars require an animatable representation of digital humans. Doing so instantly from monocular videos offers scalability to broad class of users and wide-scale applications. In this paper, we present a fast, simple, yet effective method for creating animatable 3D digital humans from monocular videos. Our method utilizes the efficiency of Gaussian splatting to model both 3D geometry and appearance. However, we observed that naively optimizing Gaussian splats results in inaccurate geometry, thereby leading to poor animations.
This work achieves and illustrates the need of accurate 3D mesh-type modelling of the human body for animatable digitization through Gaussian splats. This is achieved by developing a novel pipeline that benefits from three key aspects: (a) implicit modelling of surface’s displacements and the color’s spherical harmonics; (b) binding of 3D Gaussians to the respective triangular faces of the body template; (c) a novel technique to render normals followed by their auxiliary supervision. Our exhaustive experiments on three different benchmark datasets demonstrates the state-of-the-art results of our method, in limited time settings. In fact, our method is faster by an order of magnitude (in terms of training time) than its closest competitor. At the same time, we achieve superior rendering and 3D reconstruction performance under the change of poses. Our source code will be made publicly available.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Detailed human avatars from monocular video. In: 2018 International Conference on 3D Vision (3DV), pp. 98–109. IEEE (2018)
Alldieck, T., Magnor, M.A., Bhatnagar, B.L., Theobalt, C., Pons-Moll, G.: Learning to reconstruct people in clothing from a single RGB camera. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, 16–20 June 2019, pp. 1175–1186. Computer Vision Foundation/IEEE (2019). https://doi.org/10.1109/CVPR.2019.00127
Alldieck, T., Magnor, M.A., Xu, W., Theobalt, C., Pons-Moll, G.: Video based reconstruction of 3d people models. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, 18–22 June 2018, pp. 8387–8397. Computer Vision Foundation/IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00875
Bhatnagar, B.L., Sminchisescu, C., Theobalt, C., Pons-Moll, G.: Combining implicit function learning and parametric models for 3d human reconstruction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J. (eds.) ECCV 2020, Part II. LNCS, vol. 12347, pp. 311–329. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_19
Bhatnagar, B.L., Sminchisescu, C., Theobalt, C., Pons-Moll, G.: Loopreg: self-supervised learning of implicit surface correspondences, pose and shape for 3d human mesh registration. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6–12 December 2020, Virtual (2020). https://proceedings.neurips.cc/paper/2020/hash/970af30e481057c48f87e101b61e6994-Abstract.html
Bhatnagar, B.L., Tiwari, G., Theobalt, C., Pons-Moll, G.: Multi-garment net: learning to dress 3d people from images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5420–5430 (2019)
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_34
Chen, J., Zhang, Y., Kang, D., Zhe, X., Bao, L., Jia, X., Lu, H.: Animatable neural radiance fields from monocular rgb videos. arXiv preprint arXiv:2106.13629 (2021)
Chen, X., et al.: Fast-SNARF: a fast deformer for articulated neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 45(10), 11796–11809 (2023)
Chen, X., Zheng, Y., Black, M.J., Hilliges, O., Geiger, A.: Snarf: differentiable forward skinning for animating non-rigid neural implicit shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11594–11604 (2021)
Dragomir, A., Praveen, S., Daphne, K., Sebastian, T., Jim, R., James, D.: Scape. ACM Trans. Graph. (2005). https://doi.org/10.1145/1073204.1073207
Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3d face model from in-the-wild images. ACM Trans. Graph. 40(4), 1–13 (2021)
Gafni, G., Thies, J., Zollhofer, M., Niessner, M.: Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8649–8658 (2021). https://openaccess.thecvf.com/content/CVPR2021/html/Gafni_Dynamic_Neural_Radiance_Fields_for_Monocular_4D_Facial_Avatar_Reconstruction_CVPR_2021_paper.html
Geng, C., Peng, S., Xu, Z., Bao, H., Zhou, X.: Learning neural volumetric representations of dynamic humans in minutes. In: CVPR (2023)
Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A., Malik, J.: Humans in 4d: reconstructing and tracking humans with transformers. arXiv preprint arXiv:2305.20091 (2023)
Guo, C., Jiang, T., Chen, X., Song, J., Hilliges, O.: Vid2avatar: 3d avatar reconstruction from videos in the wild via self-supervised scene decomposition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12858–12868 (2023)
Guédon, A., Lepetit, V.: Sugar: surface-aligned gaussian splatting for efficient 3d mesh reconstruction and high-quality mesh rendering (2023)
He, T., Xu, Y., Saito, S., Soatto, S., Tung, T.: Arch++: animation-ready clothed human reconstruction revisited. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11046–11056 (2021). https://openaccess.thecvf.com/content/ICCV2021/html/He_ARCH_Animation-Ready_Clothed_Human_Reconstruction_Revisited_ICCV_2021_paper.html
Huang, Z., Xu, Y., Lassner, C., Li, H., Tung, T.: Arch: animatable reconstruction of clothed humans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3093–3102 (2020)
Jena, R., Iyer, G.S., Choudhary, S., Smith, B., Chaudhari, P., Gee, J.: SplatArmor: articulated Gaussian splatting for animatable humans from monocular RGB videos. arXiv preprint arXiv:2311.10812 [cs] (2023)
Jiang, B., Hong, Y., Bao, H., Zhang, J.: Selfrecon: self reconstruction your digital avatar from monocular video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5605–5615 (2022)
Jiang, T., Chen, X., Song, J., Hilliges, O.: Instantavatar: learning avatars from monocular video in 60 seconds. In: CVPR (2023)
Jiang, W., Yi, K.M., Samei, G., Tuzel, O., Ranjan, A.: Neuman: neural human radiance field from a single video. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXII. LNCS, vol. 13692, pp. 402–418. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19824-3_24
Jiang, Y., Yao, K., Su, Z., Shen, Z., Luo, H., Xu, L.: Instant-nvr: instant neural volumetric rendering for human-object interactions from monocular RGBD stream. In: CVPR (2023)
Kanazawa, A., Black, M.J., Jacobs, D., Malik, J.: End-to-end recovery of human shape and pose. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017). https://doi.org/10.1109/CVPR.2018.00744
Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: Proceedings of the Fourth Eurographics Symposium on Geometry processing, vol. 7 (2006)
Kerbl, B., Kopanas, G., Leimkuehler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4) (2023)
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4), 139:1–139:14 (2023). https://doi.org/10.1145/3592433
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2015)
Kocabas, M., Chang, J.H.R., Gabriel, J., Tuzel, O., Ranjan, A.: HUGS: human Gaussian splats. arXiv preprint arXiv:2311.17910 [cs] (2023)
Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: CVPR (2019)
Kwon, Y., Kim, D., Ceylan, D., Fuchs, H.: Neural human performer: learning generalizable radiance fields for human performance rendering. Adv. Neural. Inf. Process. Syst. 34, 24741–24752 (2021)
Lei, J., Wang, Y., Pavlakos, G., Liu, L., Daniilidis, K.: GART: Gaussian articulated template models arXiv preprint arXiv:2311.16099 [cs] (2023)
Lewis, J.P., Cordner, M., Fong, N.: Pose space deformation: a unified approach to shape interpolation and skeleton-driven deformation. In: Seminal Graphics Papers: Pushing the Boundaries, vol. 2, pp. 811–818 (2023)
Li, M., Yao, S., Xie, Z., Chen, K., Jiang, Y.G.: Gaussianbody: clothed human reconstruction via 3d gaussian splatting. arXiv preprint arXiv:2401.09720 (2024)
Li, M., Tao, J., Yang, Z., Yang, Y.: Human101: training 100+FPS human Gaussians in 100s from 1 view. arXiv preprint arXiv:2312.15258 [cs] (2023)
Li, R., et al.: TAVA: template-free animatable volumetric actors. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXII. LNCS, vol. 13692, pp. 419–436. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19824-3_25
Li, Z., Zheng, Z., Wang, L., Liu, Y.: Animatable Gaussians: learning pose-dependent Gaussian maps for high-fidelity human avatar modeling. arXiv preprint arXiv:2311.16096 [cs] (2023)
Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: ICCV (2021)
Liu, L., Habermann, M., Rudnev, V., Sarkar, K., Gu, J., Theobalt, C.: Neural actor: neural free-view synthesis of human actors with pose control. ACM Trans. Graph. 40(6), 1–16 (2021)
Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution 3d surface construction algorithm. In: Seminal Graphics: Pioneering Efforts that Shaped the Field, pp. 347–353 (1998)
Marc, H., Lingjie, L., Weipeng, X., Gerard, P.M., Michael, Z., Christian, T.: HD humans. In: Proc. ACM Comput. Graph. Interact. Techniq. (2023). https://doi.org/10.1145/3606927
Matthew, L., Naureen, M., Javier, R., Gerard, P.M., J., B.M.: SMPL. ACM Trans. Graph. (2015). https://doi.org/10.1145/2816795.2818013
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3d reconstruction in function space. In: CVPR (2019)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. Eur. Conf. Comput. Vision (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Moon, G., Lee, K.M.: I2l-meshnet: image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single RGB image. Eur. Conf. Comput. Vision (2020). https://doi.org/10.1007/978-3-030-58571-6_44
Moreau, A., Song, J., Dhamo, H., Shaw, R., Zhou, Y., Pérez-Pellitero, E.: Human Gaussian splatting: real-time rendering of animatable avatars. arXiv:2311.17113 [cs] (2023)
Noguchi, A., Sun, X., Lin, S., Harada, T.: Neural articulated radiance field. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5762–5772 (2021). https://openaccess.thecvf.com/content/ICCV2021/html/Noguchi_Neural_Articulated_Radiance_Field_ICCV_2021_paper.html
Omran, M., Lassner, C., Pons-Moll, G., Gehler, P., Schiele, B.: Neural body fitting: unifying deep learning and model-based human pose and shape estimation. In: International Conference on 3D Vision (2018). https://doi.org/10.1109/3DV.2018.00062
Osman, A.A.A., Bolkart, T., Black, M.J.: STAR: sparse trained articulated human body regressor. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 598–613. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_36
Pang, H., Zhu, H., Kortylewski, A., Theobalt, C., Habermann, M.: ASH: animatable Gaussian splats for efficient and photoreal human rendering. arXiv preprint arXiv:2312.05941 [cs] (2023)
Park, J.J., Florence, P.R., Straub, J., Newcombe, R.A., Lovegrove, S.: Deepsdf: learning continuous signed distance functions for shape representation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, 16–20 June 2019, pp. 165–174. Computer Vision Foundation/IEEE (2019). https://doi.org/10.1109/CVPR.2019.00025
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library (2019)
Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3d human pose and shape from a single color image. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018). https://doi.org/10.1109/CVPR.2018.00055
Pavlakos, G., et al.: Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Peng, S., et al.: Animatable neural radiance fields for modeling dynamic human bodies. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14314–14323 (2021)
Peng, S., et al.: Animatable neural radiance fields for modeling dynamic human bodies. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 14294–14303 (2021). https://doi.org/10.1109/ICCV48922.2021.01405
Peng, S., et al.: Neural body: implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, 19–25 June 2021, pp. 9054–9063. Computer Vision Foundation/IEEE (2021). https://doi.org/10.1109/CVPR46437.2021.00894
Qian, S., Kirschstein, T., Schoneveld, L., Davoli, D., Giebenhain, S., Nießner, M.: Gaussianavatars: photorealistic head avatars with rigged 3d Gaussians. arXiv preprint arXiv: 2312.02069 (2023)
Qian, Z., Wang, S., Mihajlovic, M., Geiger, A., Tang, S.: 3DGS-Avatar: animatable avatars via deformable 3D Gaussian splatting. arXiv preprint arXiv:2312.09228 [cs] (2023)
Saito, S., Simon, T., Saragih, J., Joo, H.: Pifuhd: multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In: CVPR (2020)
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113 (2016)
Shahbazi, M., et al.: Nerf-gan distillation for efficient 3d-aware generation with convolutions. arXiv preprint arXiv:2303.12865 (2023)
Su, S.Y., Bagautdinov, T.M., Rhodin, H.: Danbo: disentangled articulated neural body representations via graph neural networks. In: European Conference on Computer Vision (2022). https://doi.org/10.48550/arXiv.2205.01666
Su, S.Y., Yu, F., Zollhoefer, M., Rhodin, H.: A-nerf: articulated neural radiance fields for learning human shape, appearance, and pose. In: NEURIPS (2021)
Thomas, M., Alex, E., Christoph, S., Alexander, K.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (2022). https://doi.org/10.1145/3528223.3530127
Waczyńska, J., Borycki, P., Tadeja, S., Tabor, J., Spurek, P.: Games: mesh-based adapting and modification of gaussian splatting. arXiv preprint arXiv:2402.01459 (2024)
Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., Wang, W.: Neus: learning neural implicit surfaces by volume rendering for multi-view reconstruction. In: NEURIPS (2021)
Wang, S., Schwarz, K., Geiger, A., Tang, S.: ARAH: animatable volume rendering of articulated human SDFS. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part XXXII. LNCS, vol. 13692, pp. 1–19. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19824-3_1
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807 (2018)
Wang, Y., Gao, Q., Liu, L., Liu, L., Theobalt, C., Chen, B.: Neural novel actor: learning a generalized animatable neural representation for human actors. IEEE Trans. Visualiz. Comput. Graph. (2022). https://doi.org/10.48550/arXiv.2208.11905
Wang, Y., Daniilidis, K.: Refit: recurrent fitting network for 3d human recovery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14644–14654 (2023)
Weng, C., Curless, B., Srinivasan, P.P., Barron, J.T., Kemelmacher-Shlizerman, I.: Humannerf: free-viewpoint rendering of moving people from monocular video. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16189–16199 (2022). https://doi.org/10.1109/CVPR52688.2022.01573
Xiu, Y., Yang, J., Cao, X., Tzionas, D., Black, M.J.: Econ: explicit clothed humans optimized via normal integration. Comput. Vision Pattern Recognit. (2022). https://doi.org/10.1109/CVPR52729.2023.00057
Xiu, Y., Yang, J., Tzionas, D., Black, M.J.: Icon: implicit clothed humans obtained from normals. In: CVPR (2022)
Xu, H., Alldieck, T., Sminchisescu, C.: H-nerf: neural radiance fields for rendering and temporal reconstruction of humans in motion. In: NEURIPS (2021)
Yariv, L., Gu, J., Kasten, Y., Lipman, Y.: Volume rendering of neural implicit surfaces. Adv. Neural. Inf. Process. Syst. 34, 4805–4815 (2021)
Yariv, L., Gu, J., Kasten, Y., Lipman, Y.: Volume rendering of neural implicit surfaces. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 4805–4815. Curran Associates, Inc. (2021). https://proceedings.neurips.cc/paper_files/paper/2021/file/25e2a30f44898b9f3e978b1786dcd85c-Paper.pdf
Yu, Z., Cheng, W., Liu, X., Wu, W., Lin, K.Y.: Monohuman: animatable human neural field from monocular video. In: CVPR (2023)
Yuan, Y., et al.: Gavatar: animatable 3d Gaussian avatars with implicit mesh learning. arXiv preprint arXiv:2312.11461 (2023)
Zablotskaia, P., Siarohin, A., Zhao, B., Sigal, L.: Dwnet: dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:1910.09139 (2019)
Zhang, J., et al.: Editable free-viewpoint video using a layered neural representation. ACM Trans. Graph. 40(4), 1–18 (2021)
Zhao, F., et al.: Human performance modeling and rendering via neural animated mesh. ACM Trans. Graph. 41(6), 235:1–235:17 (2022). https://doi.org/10.1145/3550454.3555451
Zheng, Z., Huang, H., Yu, T., Zhang, H., Guo, Y., Liu, Y.: Structured local radiance fields for human avatar modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15893–15903 (2022). https://openaccess.thecvf.com/content/CVPR2022/html/Zheng_Structured_Local_Radiance_Fields_for_Human_Avatar_Modeling_CVPR_2022_paper.html
Zhu, H., Zhan, F., Theobalt, C., Habermann, M.: Trihuman: a real-time and controllable tri-plane representation for detailed human geometry and appearance synthesis. arXiv preprint arXiv:2312.05161 (2023)
Zielonka, W., Bagautdinov, T., Saito, S., Zollhöfer, M., Thies, J., Romero, J.: Drivable 3D Gaussian avatars. arXiv preprint arXiv:2311.08581 [cs] (2023)
Zielonka, W., Bolkart, T., Thies, J.: Instant volumetric head avatars. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, 17–24 June 2023, pp. 4574–4584. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.00444
Acknowledgements
We would like to express our sincere gratitude to Mr. Kobid Upadhyay for his invaluable assistance in creating the figure and rendering the 3D models in blender. His contributions have significantly enhanced the clarity and quality of our visual presentations.
We would also like to thank Alternative Technology (https://alternative.com.np) for providing us with an RTX 4090 for experimentation, which greatly facilitated our research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Paudel, P., Khanal, A., Paudel, D.P., Tandukar, J., Chhatkuli, A. (2025). iHuman: Instant Animatable Digital Humans From Monocular Videos. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15133. Springer, Cham. https://doi.org/10.1007/978-3-031-73226-3_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-73226-3_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73225-6
Online ISBN: 978-3-031-73226-3
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science