Abstract
Full-body egocentric pose estimation from head and hand poses alone has become an active area of research to power articulate avatar representations on headset-based platforms. However, existing methods over-rely on the indoor motion-capture spaces in which datasets were recorded, while simultaneously assuming continuous joint motion capture and uniform body dimensions. We propose EgoPoser to overcome these limitations with four main contributions. 1) EgoPoser robustly models body pose from intermittent hand position and orientation tracking only when inside a headset’s field of view. 2) We rethink input representations for headset-based ego-pose estimation and introduce a novel global motion decomposition method that predicts full-body pose independent of global positions. 3) We enhance pose estimation by capturing longer motion time series through an efficient SlowFast module design that maintains computational efficiency. 4) EgoPoser generalizes across various body shapes for different users. We experimentally evaluate our method and show that it outperforms state-of-the-art methods both qualitatively and quantitatively while maintaining a high inference speed of over 600 fps. EgoPoser establishes a robust baseline for future work where full-body pose estimation no longer needs to rely on outside-in capture and can scale to large-scale and unseen environments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
CMU MoCap Dataset. (2004). http://mocap.cs.cmu.edu/
RootMotion Final IK. (2018). https://assetstore.unity.com/packages/tools/animation/final-ik-14290
Ahuja, K., Harrison, C., Goel, M., Xiao, R.: Mecap: whole-body digitization for low-cost vr/ar headsets. In: Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology, pp. 453–462 (2019)
Ahuja, K., Ofek, E., Gonzalez-Franco, M., Holz, C., Wilson, A.D.: Coolmoves: user motion accentuation in virtual reality. Proc. ACM Interact. Mobile Wearable Ubiquit. Technol. 5(2), 1–23 (2021)
Ahuja, K., Shen, V., Fang, C.M., Riopelle, N., Kong, A., Harrison, C.: Controllerpose: inside-out body capture with VR controller cameras. In: CHI Conference on Human Factors in Computing Systems, pp. 1–13 (2022)
Akada, H., Wang, J., Shimada, S., Takahashi, M., Theobalt, C., Golyanik, V.: UnrealEgo: a new dataset for robust egocentric 3D human motion capture. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part VI, pp. 1–17. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_1
Aliakbarian, S., Cameron, P., Bogo, F., Fitzgibbon, A., Cashman, T.J.: Flag: flow-based 3d avatar generation from sparse observations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13253–13262 (2022)
Armani, R., Qian, C., Jiang, J., Holz, C.: Ultra inertial poser: scalable motion capture and tacking from sparse inertial sensors and ultra-wideband ranging. In: ACM SIGGRAPH 2024 Conference Papers (SIGGRAPH 2024). Association for Computing Machinery, New York (2024)
Bailly, G., Müller, J., Rohs, M., Wigdor, D., Kratz, S.: Shoesense: a new perspective on gestural interaction and wearable applications. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1239–1248 (2012)
Dittadi, A., Dziadzio, S., Cosker, D., Lundell, B., Cashman, T.J., Shotton, J.: Full-body motion from a single head-mounted device: generating SMPL poses from partial observations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11687–11697 (2021)
Du, Y., Kips, R., Pumarola, A., Starke, S., Thabet, A., Sanakoyeu, A.: Avatars grow legs: generating smooth human motion from sparse tracking inputs with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Fender, A., Müller, J.: Velt: a framework for multi RGB-D camera systems. In: Proceedings of the 2018 ACM International Conference on Interactive Surfaces and Spaces, pp. 73–83 (2018)
Grauman, K., et al.: Ego4d: around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012 (2022)
Grauman, K., et al.: Ego-exo4d: understanding skilled human activity from first-and third-person perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19383–19400 (2024)
Guzov, V., Mir, A., Sattler, T., Pons-Moll, G.: Human poseitioning system (HPS): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4318–4329 (2021)
Han, S., et al.: Megatrack: monochrome egocentric articulated hand-tracking for virtual reality. ACM Trans. Graph. 39(4), 87–1 (2020)
Han, S., et al.: Umetrack: unified multi-view end-to-end hand tracking for VR. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–9 (2022)
Huang, Y., Kaufmann, M., Aksan, E., Black, M.J., Hilliges, O., Pons-Moll, G.: Deep inertial poser: learning to reconstruct human pose from sparse inertial measurements in real time. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 37, 185:1–185:15 (2018)
Jiang, J., Streli, P., Luo, X., Gebhardt, C., Holz, C.: MANIKIN: biomechanically accurate neural inverse kinematics for human motion estimation. In: European Conference on Computer Vision. Springer (2024)
Jiang, J., et al.: AvatarPoser: articulated full-body pose tracking from sparse motion sensing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part V, pp. 443–460. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20065-6_26
Jiang, Y., Ye, Y., Gopinath, D., Won, J., Winkler, A.W., Liu, C.K.: Transformer inertial poser: real-time human motion reconstruction from sparse IMUS with simultaneous terrain generation. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–9 (2022)
Kang, T., Lee, K., Zhang, J., Lee, Y.: Ego3dpose: capturing 3d cues from binocular egocentric views. In: SIGGRAPH Asia 2023 Conference Papers, pp. 1–10 (2023)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)
Lee, S., Starke, S., Ye, Y., Won, J., Winkler, A.: Questenvsim: environment-aware simulated motion tracking from sparse sensors. arXiv preprint arXiv:2306.05666 (2023)
Li, J., Liu, K., Wu, J.: Ego-body pose estimation via ego-head pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17142–17151 (2023)
Li, S., et al.: A mobile robot hand-arm teleoperation system by vision and IMU. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10900–10906. IEEE (2020)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 1–16 (2015)
Ma, L., et al.: Nymeria: a massive collection of multimodal egocentric daily motion in the wild. arXiv preprint arXiv:2406.09905 (2024)
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: International Conference on Computer Vision, pp. 5442–5451 (2019)
Mollyn, V., Arakawa, R., Goel, M., Harrison, C., Ahuja, K.: Imuposer: full-body pose estimation using IMUS in phones, watches, and earbuds. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2023)
Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., Weber, A.: Documentation mocap database hdm05. Tech. Rep. CG-2007-2, Universität Bonn (2007)
Parger, M., et al.: UNOC: understanding occlusion for embodied presence in virtual reality. IEEE Trans. Visual Comput. Graph. 28(12), 4240–4251 (2021)
Ponton, J.L., Yun, H., Aristidou, A., Andujar, C., Pelechano, N.: Sparseposer: real-time full-body motion reconstruction from sparse data. ACM Trans. Graph. 43(1), 1–14 (2023)
Rhodin, H., et al.: Egocap: egocentric marker-less motion capture with two fisheye cameras. ACM Trans. Graph. 35(6), 1–11 (2016)
Streli, P., Armani, R., Cheng, Y.F., Holz, C.: HOOV: hand out-of-view tracking for proprioceptive interaction using inertial sensing. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–16 (2023)
Troje, N.F.: Decomposing biological motion: a framework for analysis and synthesis of human gait patterns. J. Vision 2(5), 2 (2002)
Van Wouwe, T., Lee, S., Falisse, A., Delp, S., Liu, C.K.: Diffusionposer: real-time human motion reconstruction from arbitrary sparse sensors using autoregressive diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2513–2523 (2024)
Von Marcard, T., Rosenhahn, B., Black, M.J., Pons-Moll, G.: Sparse inertial poser: automatic 3d human pose estimation from sparse IMUS. In: Computer Graphics Forum, vol. 36, pp. 349–360. Wiley Online Library (2017)
Wang, J., Liu, L., Xu, W., Sarkar, K., Theobalt, C.: Estimating egocentric 3D human pose in global space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11500–11509 (2021)
Winkler, A., Won, J., Ye, Y.: Questsim: human motion tracking from sparse sensors with simulated avatars. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–8 (2022)
Wu, E., Yuan, Y., Yeo, H.S., Quigley, A., Koike, H., Kitani, K.M.: Back-hand-pose: 3D hand pose estimation for a wrist-worn camera via dorsum deformation network. In: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pp. 1147–1160 (2020)
Xie, X., Bhatnagar, B.L., Pons-Moll, G.: Visibility aware human-object interaction tracking from single RGB camera. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4757–4768 (2023)
Yang, D., Kim, D., Lee, S.H.: Lobstr: real-time lower-body pose prediction from sparse upper-body tracking signals. In: Computer Graphics Forum, vol. 40, pp. 265–275. Wiley Online Library (2021)
Yi, X., et al.: Egolocate: real-time motion capture, localization, and mapping with sparse body-mounted sensors. ACM Trans. Graph. 42(4), 1–17 (2023)
Yi, X., et al.: Physical inertial poser (PIP): physics-aware real-time human motion tracking from sparse inertial sensors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13167–13178 (2022)
Yi, X., Zhou, Y., Xu, F.: Transpose: real-time 3d human translation and pose estimation with six inertial sensors. ACM Trans. Graph. 40(4), 1–13 (2021)
Yi, X., Zhou, Y., Xu, F.: Physical non-inertial poser (PNP): modeling non-inertial effects in sparse-inertial human motion capture. In: ACM SIGGRAPH 2024 Conference Papers, pp. 1–11 (2024)
Yuan, Y., Wei, S.E., Simon, T., Kitani, K., Saragih, J.: Simpoe: simulated character control for 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7159–7169 (2021)
Zhao, D., Wei, Z., Mahmud, J., Frahm, J.M.: Egoglass: egocentric-view human pose estimation from an eyeglass frame. In: 2021 International Conference on 3D Vision (3DV), pp. 32–41. IEEE (2021)
Zheng, X., Su, Z., Wen, C., Xue, Z., Jin, X.: Realistic full-body tracking from sparse observations via joint-level modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14678–14688 (2023)
Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5745–5753 (2019)
Acknowledgement
We sincerely thank Andreas Fender for his help with data recording, testing, and manuscript proofreading.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Jiang, J., Streli, P., Meier, M., Holz, C. (2025). EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15060. Springer, Cham. https://doi.org/10.1007/978-3-031-72627-9_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-72627-9_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72626-2
Online ISBN: 978-3-031-72627-9
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science