Skip to main content

EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15060))

Included in the following conference series:

  • 1178 Accesses

  • 6 Citations

Abstract

Full-body egocentric pose estimation from head and hand poses alone has become an active area of research to power articulate avatar representations on headset-based platforms. However, existing methods over-rely on the indoor motion-capture spaces in which datasets were recorded, while simultaneously assuming continuous joint motion capture and uniform body dimensions. We propose EgoPoser to overcome these limitations with four main contributions. 1) EgoPoser robustly models body pose from intermittent hand position and orientation tracking only when inside a headset’s field of view. 2) We rethink input representations for headset-based ego-pose estimation and introduce a novel global motion decomposition method that predicts full-body pose independent of global positions. 3) We enhance pose estimation by capturing longer motion time series through an efficient SlowFast module design that maintains computational efficiency. 4) EgoPoser generalizes across various body shapes for different users. We experimentally evaluate our method and show that it outperforms state-of-the-art methods both qualitatively and quantitatively while maintaining a high inference speed of over 600 fps. EgoPoser establishes a robust baseline for future work where full-body pose estimation no longer needs to rely on outside-in capture and can scale to large-scale and unseen environments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+
from €37.37 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Chapter
EUR 29.95
Price includes VAT (Netherlands)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 60.98
Price includes VAT (Netherlands)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 80.65
Price includes VAT (Netherlands)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. CMU MoCap Dataset. (2004). http://mocap.cs.cmu.edu/

  2. RootMotion Final IK. (2018). https://assetstore.unity.com/packages/tools/animation/final-ik-14290

  3. Ahuja, K., Harrison, C., Goel, M., Xiao, R.: Mecap: whole-body digitization for low-cost vr/ar headsets. In: Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology, pp. 453–462 (2019)

    Google Scholar 

  4. Ahuja, K., Ofek, E., Gonzalez-Franco, M., Holz, C., Wilson, A.D.: Coolmoves: user motion accentuation in virtual reality. Proc. ACM Interact. Mobile Wearable Ubiquit. Technol. 5(2), 1–23 (2021)

    Article  Google Scholar 

  5. Ahuja, K., Shen, V., Fang, C.M., Riopelle, N., Kong, A., Harrison, C.: Controllerpose: inside-out body capture with VR controller cameras. In: CHI Conference on Human Factors in Computing Systems, pp. 1–13 (2022)

    Google Scholar 

  6. Akada, H., Wang, J., Shimada, S., Takahashi, M., Theobalt, C., Golyanik, V.: UnrealEgo: a new dataset for robust egocentric 3D human motion capture. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part VI, pp. 1–17. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_1

  7. Aliakbarian, S., Cameron, P., Bogo, F., Fitzgibbon, A., Cashman, T.J.: Flag: flow-based 3d avatar generation from sparse observations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13253–13262 (2022)

    Google Scholar 

  8. Armani, R., Qian, C., Jiang, J., Holz, C.: Ultra inertial poser: scalable motion capture and tacking from sparse inertial sensors and ultra-wideband ranging. In: ACM SIGGRAPH 2024 Conference Papers (SIGGRAPH 2024). Association for Computing Machinery, New York (2024)

    Google Scholar 

  9. Bailly, G., Müller, J., Rohs, M., Wigdor, D., Kratz, S.: Shoesense: a new perspective on gestural interaction and wearable applications. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1239–1248 (2012)

    Google Scholar 

  10. Dittadi, A., Dziadzio, S., Cosker, D., Lundell, B., Cashman, T.J., Shotton, J.: Full-body motion from a single head-mounted device: generating SMPL poses from partial observations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11687–11697 (2021)

    Google Scholar 

  11. Du, Y., Kips, R., Pumarola, A., Starke, S., Thabet, A., Sanakoyeu, A.: Avatars grow legs: generating smooth human motion from sparse tracking inputs with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

    Google Scholar 

  12. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)

    Google Scholar 

  13. Fender, A., Müller, J.: Velt: a framework for multi RGB-D camera systems. In: Proceedings of the 2018 ACM International Conference on Interactive Surfaces and Spaces, pp. 73–83 (2018)

    Google Scholar 

  14. Grauman, K., et al.: Ego4d: around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012 (2022)

    Google Scholar 

  15. Grauman, K., et al.: Ego-exo4d: understanding skilled human activity from first-and third-person perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19383–19400 (2024)

    Google Scholar 

  16. Guzov, V., Mir, A., Sattler, T., Pons-Moll, G.: Human poseitioning system (HPS): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4318–4329 (2021)

    Google Scholar 

  17. Han, S., et al.: Megatrack: monochrome egocentric articulated hand-tracking for virtual reality. ACM Trans. Graph. 39(4), 87–1 (2020)

    Google Scholar 

  18. Han, S., et al.: Umetrack: unified multi-view end-to-end hand tracking for VR. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–9 (2022)

    Google Scholar 

  19. Huang, Y., Kaufmann, M., Aksan, E., Black, M.J., Hilliges, O., Pons-Moll, G.: Deep inertial poser: learning to reconstruct human pose from sparse inertial measurements in real time. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 37, 185:1–185:15 (2018)

    Google Scholar 

  20. Jiang, J., Streli, P., Luo, X., Gebhardt, C., Holz, C.: MANIKIN: biomechanically accurate neural inverse kinematics for human motion estimation. In: European Conference on Computer Vision. Springer (2024)

    Google Scholar 

  21. Jiang, J., et al.: AvatarPoser: articulated full-body pose tracking from sparse motion sensing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part V, pp. 443–460. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20065-6_26

  22. Jiang, Y., Ye, Y., Gopinath, D., Won, J., Winkler, A.W., Liu, C.K.: Transformer inertial poser: real-time human motion reconstruction from sparse IMUS with simultaneous terrain generation. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–9 (2022)

    Google Scholar 

  23. Kang, T., Lee, K., Zhang, J., Lee, Y.: Ego3dpose: capturing 3d cues from binocular egocentric views. In: SIGGRAPH Asia 2023 Conference Papers, pp. 1–10 (2023)

    Google Scholar 

  24. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)

    Google Scholar 

  25. Lee, S., Starke, S., Ye, Y., Won, J., Winkler, A.: Questenvsim: environment-aware simulated motion tracking from sparse sensors. arXiv preprint arXiv:2306.05666 (2023)

  26. Li, J., Liu, K., Wu, J.: Ego-body pose estimation via ego-head pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17142–17151 (2023)

    Google Scholar 

  27. Li, S., et al.: A mobile robot hand-arm teleoperation system by vision and IMU. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10900–10906. IEEE (2020)

    Google Scholar 

  28. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 1–16 (2015)

    Article  Google Scholar 

  29. Ma, L., et al.: Nymeria: a massive collection of multimodal egocentric daily motion in the wild. arXiv preprint arXiv:2406.09905 (2024)

  30. Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: International Conference on Computer Vision, pp. 5442–5451 (2019)

    Google Scholar 

  31. Mollyn, V., Arakawa, R., Goel, M., Harrison, C., Ahuja, K.: Imuposer: full-body pose estimation using IMUS in phones, watches, and earbuds. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2023)

    Google Scholar 

  32. Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., Weber, A.: Documentation mocap database hdm05. Tech. Rep. CG-2007-2, Universität Bonn (2007)

    Google Scholar 

  33. Parger, M., et al.: UNOC: understanding occlusion for embodied presence in virtual reality. IEEE Trans. Visual Comput. Graph. 28(12), 4240–4251 (2021)

    Google Scholar 

  34. Ponton, J.L., Yun, H., Aristidou, A., Andujar, C., Pelechano, N.: Sparseposer: real-time full-body motion reconstruction from sparse data. ACM Trans. Graph. 43(1), 1–14 (2023)

    Google Scholar 

  35. Rhodin, H., et al.: Egocap: egocentric marker-less motion capture with two fisheye cameras. ACM Trans. Graph. 35(6), 1–11 (2016)

    Google Scholar 

  36. Streli, P., Armani, R., Cheng, Y.F., Holz, C.: HOOV: hand out-of-view tracking for proprioceptive interaction using inertial sensing. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–16 (2023)

    Google Scholar 

  37. Troje, N.F.: Decomposing biological motion: a framework for analysis and synthesis of human gait patterns. J. Vision 2(5), 2 (2002)

    Google Scholar 

  38. Van Wouwe, T., Lee, S., Falisse, A., Delp, S., Liu, C.K.: Diffusionposer: real-time human motion reconstruction from arbitrary sparse sensors using autoregressive diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2513–2523 (2024)

    Google Scholar 

  39. Von Marcard, T., Rosenhahn, B., Black, M.J., Pons-Moll, G.: Sparse inertial poser: automatic 3d human pose estimation from sparse IMUS. In: Computer Graphics Forum, vol. 36, pp. 349–360. Wiley Online Library (2017)

    Google Scholar 

  40. Wang, J., Liu, L., Xu, W., Sarkar, K., Theobalt, C.: Estimating egocentric 3D human pose in global space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11500–11509 (2021)

    Google Scholar 

  41. Winkler, A., Won, J., Ye, Y.: Questsim: human motion tracking from sparse sensors with simulated avatars. In: SIGGRAPH Asia 2022 Conference Papers, pp. 1–8 (2022)

    Google Scholar 

  42. Wu, E., Yuan, Y., Yeo, H.S., Quigley, A., Koike, H., Kitani, K.M.: Back-hand-pose: 3D hand pose estimation for a wrist-worn camera via dorsum deformation network. In: Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pp. 1147–1160 (2020)

    Google Scholar 

  43. Xie, X., Bhatnagar, B.L., Pons-Moll, G.: Visibility aware human-object interaction tracking from single RGB camera. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4757–4768 (2023)

    Google Scholar 

  44. Yang, D., Kim, D., Lee, S.H.: Lobstr: real-time lower-body pose prediction from sparse upper-body tracking signals. In: Computer Graphics Forum, vol. 40, pp. 265–275. Wiley Online Library (2021)

    Google Scholar 

  45. Yi, X., et al.: Egolocate: real-time motion capture, localization, and mapping with sparse body-mounted sensors. ACM Trans. Graph. 42(4), 1–17 (2023)

    Google Scholar 

  46. Yi, X., et al.: Physical inertial poser (PIP): physics-aware real-time human motion tracking from sparse inertial sensors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13167–13178 (2022)

    Google Scholar 

  47. Yi, X., Zhou, Y., Xu, F.: Transpose: real-time 3d human translation and pose estimation with six inertial sensors. ACM Trans. Graph. 40(4), 1–13 (2021)

    Article  Google Scholar 

  48. Yi, X., Zhou, Y., Xu, F.: Physical non-inertial poser (PNP): modeling non-inertial effects in sparse-inertial human motion capture. In: ACM SIGGRAPH 2024 Conference Papers, pp. 1–11 (2024)

    Google Scholar 

  49. Yuan, Y., Wei, S.E., Simon, T., Kitani, K., Saragih, J.: Simpoe: simulated character control for 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7159–7169 (2021)

    Google Scholar 

  50. Zhao, D., Wei, Z., Mahmud, J., Frahm, J.M.: Egoglass: egocentric-view human pose estimation from an eyeglass frame. In: 2021 International Conference on 3D Vision (3DV), pp. 32–41. IEEE (2021)

    Google Scholar 

  51. Zheng, X., Su, Z., Wen, C., Xue, Z., Jin, X.: Realistic full-body tracking from sparse observations via joint-level modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14678–14688 (2023)

    Google Scholar 

  52. Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5745–5753 (2019)

    Google Scholar 

Download references

Acknowledgement

We sincerely thank Andreas Fender for his help with data recording, testing, and manuscript proofreading.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiaxi Jiang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jiang, J., Streli, P., Meier, M., Holz, C. (2025). EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15060. Springer, Cham. https://doi.org/10.1007/978-3-031-72627-9_16

Download citation

Keywords

Publish with us

Policies and ethics

Profiles

  1. Christian Holz