{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,2]],"date-time":"2026-01-02T07:49:23Z","timestamp":1767340163169,"version":"3.41.2"},"reference-count":57,"publisher":"Wiley","issue":"7","license":[{"start":{"date-parts":[[2024,11,7]],"date-time":"2024-11-07T00:00:00Z","timestamp":1730937600000},"content-version":"vor","delay-in-days":37,"URL":"http:\/\/onlinelibrary.wiley.com\/termsAndConditions#vor"}],"content-domain":{"domain":["onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["Computer Graphics Forum"],"published-print":{"date-parts":[[2024,10]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The emergence of learning\u2010based motion in\u2010betweening techniques offers animators a more efficient way to animate characters. However, existing non\u2010generative methods either struggle to support long transition generation or produce results that lack diversity. Meanwhile, diffusion models have shown promising results in synthesizing diverse and high\u2010quality motions driven by text and keyframes. However, in these methods, keyframes often serve as a guide rather than a strict constraint and can sometimes be ignored when keyframes are sparse. To address these issues, we propose a lightweight yet effective diffusion\u2010based motion in\u2010betweening framework that generates animations conforming to keyframe constraints. We incorporate keyframe constraints into the training phase to enhance robustness in handling various constraint densities. Moreover, we employ relative positional encoding to improve the model's generalization on long range in\u2010betweening tasks. This approach enables the model to learn from short animations while generating realistic in\u2010betweening motions spanning thousands of frames. We conduct extensive experiments to validate our framework using the newly proposed metrics K\u2010FID, K\u2010Diversity, and K\u2010Error, designed to evaluate generative in\u2010betweening methods. Results demonstrate that our method outperforms existing diffusion\u2010based methods across various lengths and keyframe densities. We also show that our method can be applied to text\u2010driven motion synthesis, offering fine\u2010grained control over the generated results.<\/jats:p>","DOI":"10.1111\/cgf.15260","type":"journal-article","created":{"date-parts":[[2024,11,8]],"date-time":"2024-11-08T07:05:03Z","timestamp":1731049503000},"update-policy":"https:\/\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Robust Diffusion\u2010based Motion In\u2010betweening"],"prefix":"10.1111","volume":"43","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6715-6339","authenticated-orcid":false,"given":"Jia","family":"Qin","sequence":"first","affiliation":[{"name":"State Key Laboratory of CAD&amp;CG Zhejiang University"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-7063-745X","authenticated-orcid":false,"given":"Peng","family":"Yan","sequence":"additional","affiliation":[{"name":"State Key Laboratory of CAD&amp;CG Zhejiang University"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-7756-4498","authenticated-orcid":false,"given":"Bo","family":"An","sequence":"additional","affiliation":[{"name":"Zhejiang University"}]}],"member":"311","published-online":{"date-parts":[[2024,11,7]]},"reference":[{"key":"e_1_2_7_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/3DV.2019.00084"},{"key":"e_1_2_7_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/1276377.1276387"},{"key":"e_1_2_7_4_2","doi-asserted-by":"crossref","unstructured":"ChenX. JiangB. LiuW. HuangZ. FuB. ChenT. YuG.: Executing your commands via motion diffusion in latent space. InProceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(2023) pp.18000\u201318010. 1 3 9","DOI":"10.1109\/CVPR52729.2023.01726"},{"key":"e_1_2_7_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/3306346.3322938"},{"key":"e_1_2_7_6_2","doi-asserted-by":"crossref","unstructured":"ChenR. ShiM. HuangS. TanP. KomuraT. ChenX.: Taming diffusion probabilistic models for character control. InACM SIGGRAPH 2024 Conference Papers(2024) pp.1\u201310. 3","DOI":"10.1145\/3641519.3657440"},{"key":"e_1_2_7_7_2","doi-asserted-by":"crossref","unstructured":"CaiY. WangY. ZhuY. ChamT.-J. CaiJ. YuanJ. LiuJ. ZhengC. YanS. DingH. et al.: A unified 3d human motion synthesis model via conditional variational auto-encoder. InProceedings of the IEEE\/CVF International Conference on Computer Vision(2021) pp.11645\u201311655. 2","DOI":"10.1109\/ICCV48922.2021.01144"},{"key":"e_1_2_7_8_2","doi-asserted-by":"crossref","unstructured":"DabralR. MughalM. H. GolyanikV. TheobaltC.: Mofusion: A framework for denoising-diffusion-based motion synthesis. InProceedings of the IEEE\/CVF conference on computer vision and pattern recognition(2023) pp.9760\u20139770. 2 9","DOI":"10.1109\/CVPR52729.2023.00941"},{"key":"e_1_2_7_9_2","first-page":"8780","article-title":"Diffusion models beat gans on image synthesis","volume":"34","author":"Dhariwal P.","year":"2021","journal-title":"Advances in neural information processing systems"},{"key":"e_1_2_7_10_2","unstructured":"DuanY. ShiT. ZouZ. LinY. QianZ. ZhangB. YuanY.: Single-shot motion completion with transformer.arXiv preprint arXiv:2103.00776(2021). 1 2"},{"key":"e_1_2_7_11_2","unstructured":"GhoshA. CheemaN. OguzC. TheobaltC. SlusallekP.: Synthesis of compositional animations from textual descriptions. InProceedings of the IEEE\/CVF international conference on computer vision(2021) pp.1396\u20131406. 9"},{"key":"e_1_2_7_12_2","doi-asserted-by":"crossref","unstructured":"GleicherM.: Motion editing with spacetime constraints. InProceedings of the 1997 symposium on Interactive 3D graphics(1997) pp.139\u2013ff. 2","DOI":"10.1145\/253284.253321"},{"key":"e_1_2_7_13_2","doi-asserted-by":"crossref","unstructured":"GuoC. ZuoX. WangS. ZouS. SunQ. DengA. GongM. ChengL.: Action2motion: Conditioned generation of 3d human motions. InProceedings of the 28th ACM International Conference on Multimedia(2020) pp.2021\u20132029. 2 5","DOI":"10.1145\/3394171.3413635"},{"key":"e_1_2_7_14_2","doi-asserted-by":"crossref","unstructured":"GuoC. ZouS. ZuoX. WangS. JiW. LiX. ChengL.: Generating diverse and natural 3d human motions from text. InProceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(June2022) pp.5152\u20135161. 5","DOI":"10.1109\/CVPR52688.2022.00509"},{"key":"e_1_2_7_15_2","doi-asserted-by":"crossref","unstructured":"GuoC. ZouS. ZuoX. WangS. JiW. LiX. ChengL.: Generating diverse and natural 3d human motions from text. InProceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(2022) pp.5152\u20135161. 9","DOI":"10.1109\/CVPR52688.2022.00509"},{"key":"e_1_2_7_16_2","unstructured":"HoJ. ChanW. SahariaC. WhangJ. GaoR. GritsenkoA. KingmaD. P. PooleB. NorouziM. FleetD. J. et al.: Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303(2022). 2"},{"key":"e_1_2_7_17_2","doi-asserted-by":"crossref","unstructured":"HernandezA. GallJ. Moreno-NoguerF.: Human motion prediction via spatio-temporal inpainting. InProceedings of the IEEE\/CVF International Conference on Computer Vision(2019) pp.7134\u20137143. 2","DOI":"10.1109\/ICCV.2019.00723"},{"key":"e_1_2_7_18_2","first-page":"6840","article-title":"Denoising diffusion probabilistic models","volume":"33","author":"Ho J.","year":"2020","journal-title":"Advances in neural information processing systems"},{"key":"e_1_2_7_19_2","doi-asserted-by":"crossref","unstructured":"HarveyF. G. PalC.: Recurrent transition networks for character locomotion. InSIGGRAPH Asia 2018 Technical Briefs(2018) pp.1\u20134. 1 2","DOI":"10.1145\/3283254.3283277"},{"key":"e_1_2_7_20_2","first-page":"8633","article-title":"Video diffusion models","volume":"35","author":"Ho J.","year":"2022","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_7_21_2","unstructured":"HuangC.-Z. A. VaswaniA. UszkoreitJ. ShazeerN. SimonI. HawthorneC. DaiA. M. HoffmanM. D. DinculescuM. EckD.: Music transformer.arXiv preprint arXiv:1809.04281(2018). 4"},{"key":"e_1_2_7_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3386569.3392480"},{"key":"e_1_2_7_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/3DV50981.2020.00102"},{"key":"e_1_2_7_24_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v37i7.25996"},{"key":"e_1_2_7_25_2","unstructured":"KongZ. PingW. HuangJ. ZhaoK. CatanzaroB.: Diffwave: A versatile diffusion model for audio synthesis.arXiv preprint arXiv:2009.09761(2020). 2"},{"key":"e_1_2_7_26_2","unstructured":"KarunratanakulK. PreechakulK. SuwajanakornS. TangS.: Guided motion diffusion for controllable human motion synthesis. InProceedings of the IEEE\/CVF International Conference on Computer Vision(2023) pp.2151\u20132162. 1 3 9"},{"key":"e_1_2_7_27_2","doi-asserted-by":"crossref","unstructured":"LehrmannA. M. GehlerP. V. NowozinS.: Efficient nonlinear markov models for human motion. InProceedings of the IEEE conference on computer vision and pattern recognition(2014) pp.1314\u20131321. 2","DOI":"10.1109\/CVPR.2014.171"},{"key":"e_1_2_7_28_2","doi-asserted-by":"crossref","unstructured":"LuoS. HuW.: Diffusion probabilistic models for 3d point cloud generation. InProceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(2021) pp.2837\u20132845. 2","DOI":"10.1109\/CVPR46437.2021.00286"},{"key":"e_1_2_7_29_2","unstructured":"MaJ. BaiS. ZhouC.: Pretrained diffusion models for unified human motion synthesis.arXiv preprint arXiv:2212.02837(2022). 2"},{"key":"e_1_2_7_30_2","unstructured":"MahmoodN. GhorbaniN. TrojeN. F. Pons-MollG. BlackM. J.: Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE\/CVF international conference on computer vision(2019) pp.5442\u20135451. 5"},{"key":"e_1_2_7_31_2","unstructured":"OreshkinB. N. ValkanasA. HarveyF. G. M\u00e9nardL.-S. BocqueletF. CoatesM. J.: Motion in-betweening via deep \u0394-interpolator.IEEE Transactions on Visualization and Computer Graphics(2023). 2"},{"key":"e_1_2_7_32_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-019-01245-6"},{"key":"e_1_2_7_33_2","unstructured":"PooleB. JainA. BarronJ. T. MildenhallB.: Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988(2022). 2"},{"key":"e_1_2_7_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3550454.3555454"},{"key":"e_1_2_7_35_2","unstructured":"RombachR. BlattmannA. LorenzD. EsserP. OmmerB.: High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE\/CVF conference on computer vision and pattern recognition(2022) pp.10684\u201310695. 2"},{"issue":"2","key":"e_1_2_7_36_2","first-page":"3","article-title":"Hierarchical text-conditional image generation with clip latents","volume":"1","author":"Ramesh A.","year":"2022","journal-title":"arXiv preprint arXiv:2204.06125"},{"key":"e_1_2_7_37_2","first-page":"8748","volume-title":"International conference on machine learning","author":"Radford A.","year":"2021"},{"key":"e_1_2_7_38_2","unstructured":"RaabS. LeibovitchI. TevetG. ArarM. BermanoA. H. Cohen-OrD.: Single motion diffusion. InThe Twelfth International Conference on Learning Representations (ICLR)(2024). 5"},{"key":"e_1_2_7_39_2","first-page":"36479","article-title":"Photorealistic text-to-image diffusion models with deep language understanding","volume":"35","author":"Saharia C.","year":"2022","journal-title":"Advances in neural information processing systems"},{"key":"e_1_2_7_40_2","unstructured":"ShahamT. R. DekelT. MichaeliT.: Singan: Learning a generative model from a single natural image. InProceedings of the IEEE\/CVF international conference on computer vision(2019) pp.4570\u20134580. 5"},{"key":"e_1_2_7_41_2","first-page":"2256","volume-title":"International conference on machine learning","author":"Sohl-Dickstein J.","year":"2015"},{"key":"e_1_2_7_42_2","article-title":"Generative modeling by estimating gradients of the data distribution","volume":"32","author":"Song Y.","year":"2019","journal-title":"Advances in neural information processing systems"},{"issue":"4","key":"e_1_2_7_43_2","first-page":"4713","article-title":"Image super-resolution via iterative refinement","volume":"45","author":"Saharia C.","year":"2022","journal-title":"IEEE transactions on pattern analysis and machine intelligence"},{"key":"e_1_2_7_44_2","unstructured":"SongJ. MengC. ErmonS.: Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502(2020). 2 9"},{"key":"e_1_2_7_45_2","unstructured":"SongY. Sohl-DicksteinJ. KingmaD. P. KumarA. ErmonS. PooleB.: Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456(2020). 2"},{"key":"e_1_2_7_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3606921"},{"key":"e_1_2_7_47_2","unstructured":"ShafirY. TevetG. KaponR. BermanoA. H.: Human motion diffusion model. InThe Twelfth International Conference on Learning Representations (ICLR)(2024). 3"},{"key":"e_1_2_7_48_2","unstructured":"TevetG. RaabS. GordonB. ShafirY. Cohen-OrD. BermanoA. H.: Human motion diffusion model. InThe Eleventh International Conference on Learning Representations (ICLR)(2023). 1 3 6 8 9"},{"key":"e_1_2_7_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3528223.3530090"},{"key":"e_1_2_7_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2007.1167"},{"key":"e_1_2_7_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/378456.378507"},{"key":"e_1_2_7_52_2","doi-asserted-by":"crossref","unstructured":"WitkinA. PopovicZ.: Motion warping. InProceedings of the 22nd annual conference on Computer graphics and interactive techniques(1995) pp.105\u2013108. 2","DOI":"10.1145\/218380.218422"},{"key":"e_1_2_7_53_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i6.28401"},{"key":"e_1_2_7_54_2","unstructured":"YuanY. SongJ. IqbalU. VahdatA. KautzJ.: Phys-diff: Physics-guided human motion diffusion model. InProceedings of the IEEE\/CVF International Conference on Computer Vision(2023) pp.16010\u201316021. 1 3 9"},{"key":"e_1_2_7_55_2","unstructured":"ZhouY. BarnesC. LuJ. YangJ. LiH.: On the continuity of rotation representations in neural networks. InProceedings of the IEEE\/CVF conference on computer vision and pattern recognition(2019) pp.5745\u20135753. 3"},{"key":"e_1_2_7_56_2","unstructured":"ZhangM. CaiZ. PanL. HongF. GuoX. YangL. LiuZ.: Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Transactions on Pattern Analysis and Machine Intelligence(2024). 1 3"},{"key":"e_1_2_7_57_2","doi-asserted-by":"crossref","unstructured":"ZhouL. DuY. WuJ.: 3d shape generation and completion through point-voxel diffusion. InProceedings of the IEEE\/CVF international conference on computer vision(2021) pp.5826\u20135835. 2","DOI":"10.1109\/ICCV48922.2021.00577"},{"key":"e_1_2_7_58_2","doi-asserted-by":"crossref","unstructured":"ZhouZ. WangB.: Ude: A unified driving engine for human motion generation. InProceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition(2023) pp.5632\u20135641. 2","DOI":"10.1109\/CVPR52729.2023.00545"}],"container-title":["Computer Graphics Forum"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1111\/cgf.15260","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,11,14]],"date-time":"2024-11-14T08:09:26Z","timestamp":1731571766000},"score":1,"resource":{"primary":{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/10.1111\/cgf.15260"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10]]},"references-count":57,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2024,10]]}},"alternative-id":["10.1111\/cgf.15260"],"URL":"https:\/\/doi.org\/10.1111\/cgf.15260","archive":["Portico"],"relation":{},"ISSN":["0167-7055","1467-8659"],"issn-type":[{"type":"print","value":"0167-7055"},{"type":"electronic","value":"1467-8659"}],"subject":[],"published":{"date-parts":[[2024,10]]},"assertion":[{"value":"2024-11-07","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}],"article-number":"e15260"}}