{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,21]],"date-time":"2025-11-21T18:19:37Z","timestamp":1763749177619,"version":"3.41.0"},"reference-count":54,"publisher":"Association for Computing Machinery (ACM)","issue":"9","license":[{"start":{"date-parts":[[2024,8,16]],"date-time":"2024-08-16T00:00:00Z","timestamp":1723766400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Shenzhen Science and Technology Program","award":["JCYJ20210324124205016"],"award-info":[{"award-number":["JCYJ20210324124205016"]}]},{"name":"Guangdong-Hong Kong-Macao Joint Laboratory for Emotional Intelligence and Pervasive Computing, Artificial Intelligence Research Institute, Shenzhen MSU-BIT University","award":["61902333"],"award-info":[{"award-number":["61902333"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,9,30]]},"abstract":"<jats:p>Widely adopted digital cameras and smartphones have generated a large number of videos, which have brought a tremendous workload to video editors. Recently, a variety of automatic\/semi-automatic video editing methods have been proposed to tackle these issues in some specific areas. However, for the production of meeting recordings, the existing studies highly depend on extra equipment in the conference venues, such as the infrared camera or special microphone, which are not practical. In this article, we design and implement Meetor, a human-centered automatic video editing system for meeting recordings. The Meetor mainly contains three parts: an audio-based video synchronization algorithm, human-centered video content flaw detection algorithms, and an automatic video editing algorithm. Two main experiments are conducted from both objective and subjective aspects to evaluate the performance of the Meetor. The experimental results on a testbed illustrate that the proposed algorithms could achieve state-of-the-art (SOTA) performance in video content flaw detection. However, the conducted user study demonstrates that Meetor could generate meeting recordings with a satisfactory quality compared with professional video editors. Moreover, we also present a practical application of the Meetor in a university campus prototype, in which the Meetor is applied in the automatic editing of lecture recordings. All in all, the proposed Meetor can be utilized in practical applications to release the workload of professional video editors.<\/jats:p>","DOI":"10.1145\/3648681","type":"journal-article","created":{"date-parts":[[2024,2,19]],"date-time":"2024-02-19T12:20:53Z","timestamp":1708345253000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Meetor: A Human-Centered Automatic Video Editing System for Meeting Recordings"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6438-3790","authenticated-orcid":false,"given":"Haihan","family":"Duan","sequence":"first","affiliation":[{"name":"Shenzhen MSU-BIT University, Shenzhen, China"},{"name":"The Chinese University of Hong Kong, Shenzhen, China"},{"name":"Mohamed bin Zayed University of Artificial Intelligence, Masdar City, United Arab Emirates"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3374-0289","authenticated-orcid":false,"given":"Junhua","family":"Liao","sequence":"additional","affiliation":[{"name":"Sichuan University, Chengdu, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9379-2232","authenticated-orcid":false,"given":"Lehao","family":"Lin","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7690-8547","authenticated-orcid":false,"given":"Abdulmotaleb","family":"El Saddik","sequence":"additional","affiliation":[{"name":"Mohamed bin Zayed University of Artificial Intelligence, Masdar City, United Arab Emirates"},{"name":"University of Ottawa, Ottawa, Canada"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4658-0034","authenticated-orcid":false,"given":"Wei","family":"Cai","sequence":"additional","affiliation":[{"name":"The Chinese University of Hong Kong, Shenzhen, China"}]}],"member":"320","published-online":{"date-parts":[[2024,8,16]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"crossref","unstructured":"Usman Ali and Muhammad Tariq Mahmood. 2018. Analysis of blur measure operators for single image blur segmentation. Appl. Sci. 8 5 (2018) 807.","DOI":"10.3390\/app8050807"},{"key":"e_1_3_2_3_2","doi-asserted-by":"crossref","unstructured":"Ido Arev Hyun Soo Park Yaser Sheikh Jessica Hodgins and Ariel Shamir. 2014. Automatic editing of footage from multiple social cameras. ACM Trans. Graph. 33 4 (2014) 1\u201311.","DOI":"10.1145\/2601097.2601198"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/SYSMART.2016.7894491"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/2501988.2502052"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3479238"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3534088.3534349"},{"key":"e_1_3_2_8_2","doi-asserted-by":"crossref","unstructured":"Abdulmotaleb El Saddik. 2018. Digital twins: The convergence of multimedia technologies. IEEE Multim. 25 2 (2018) 87\u201392.","DOI":"10.1109\/MMUL.2018.023121167"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2008.4517545"},{"key":"e_1_3_2_10_2","doi-asserted-by":"crossref","unstructured":"Martin A. Fischler and Robert C. Bolles. 1981. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24 6 (1981) 381\u2013395.","DOI":"10.1145\/358669.358692"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/354401.354415"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10599-4_11"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00735"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.243"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.42"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01258-8_38"},{"key":"e_1_3_2_18_2","doi-asserted-by":"crossref","unstructured":"Saad M. Khan and Mubarak Shah. 2008. Tracking multiple occluding people by localizing on multiple scene planes. IEEE Trans. Pattern Anal. Mach. Intell. 31 3 (2008) 505\u2013519.","DOI":"10.1109\/TPAMI.2008.102"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.5555\/189359.189391"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/2072298.2072339"},{"key":"e_1_3_2_21_2","doi-asserted-by":"crossref","unstructured":"Mackenzie Leake Abe Davis Anh Truong and Maneesh Agrawala. 2017. Computational video editing for dialogue-driven scenes. ACM Trans. Graph. 36 4 (2017) 130\u20131.","DOI":"10.1145\/3072959.3073653"},{"key":"e_1_3_2_22_2","unstructured":"Florent Lefevre Vincent Bombardier Nicolas Krommenacker Patrick Charpentier and Bertrand Petat. 2018. Automatic video stream selection method by on-air microphone detection. International Conference on Pattern Recognition and Artificial Intelligence (ICPRAI\u201918)."},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413725"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/365024.365310"},{"key":"e_1_3_2_25_2","doi-asserted-by":"crossref","unstructured":"Shijie Liu Xiaohua Tong Fengxiang Wang Wenzheng Sun Chengcheng Guo Zhen Ye Yanmin Jin Huan Xie and Peng Chen. 2016. Attitude jitter detection based on remotely sensed images and dense ground controls: A case study for Chinese ZY-3 satellite. IEEE J. Select. Topics Appl. Earth Observ. Rem. Sens. 9 12 (2016) 5760\u20135766.","DOI":"10.1109\/JSTARS.2016.2550482"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.1999.790410"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.190"},{"key":"e_1_3_2_28_2","doi-asserted-by":"crossref","unstructured":"Xiongkuo Min Guangtao Zhai Jiantao Zhou Mylene C. Q. Farias and Alan Conrad Bovik. 2020. Study of subjective and objective quality assessment of audio-visual signals. IEEE Trans. Image Process. 29 (2020) 6054\u20136068.","DOI":"10.1109\/TIP.2020.2988148"},{"key":"e_1_3_2_29_2","first-page":"3258","volume-title":"IEEE Conference on Computer Vision and Pattern Recognition","author":"Ouyang Wanli","year":"2012","unstructured":"Wanli Ouyang and Xiaogang Wang. 2012. A discriminative deep model for pedestrian detection with occlusion handling. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3258\u20133265."},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.257"},{"key":"e_1_3_2_31_2","unstructured":"Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan Trevor Killeen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas K\u00f6pf Edward Yang Zach DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit Steiner Lu Fang Junjie Bai and Soumith Chintala. 2019. Pytorch: An imperative style high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019)."},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR.2000.903548"},{"key":"e_1_3_2_33_2","unstructured":"International Telecommunication Union. 1998. Recommendation ITU-R BT.1359-1: Relative Timing of Sound and Vision for Broadcasting. https:\/\/www.itu.int\/dms_pubrec\/itu-r\/rec\/bt\/R-REC-BT.1359-1-199811-I!!PDF-E.pdf. Accessed: 03-02-2024."},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/1357054.1357095"},{"key":"e_1_3_2_35_2","first-page":"1","volume-title":"IEEE International Symposium on Broadband Multimedia Systems and Broadcasting","author":"Rassool Reza","year":"2017","unstructured":"Reza Rassool. 2017. VMAF reproducibility: Validating a perceptual practical video quality metric. In IEEE International Symposium on Broadband Multimedia Systems and Broadcasting. IEEE, 1\u20132."},{"key":"e_1_3_2_36_2","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)."},{"key":"e_1_3_2_37_2","first-page":"5","volume-title":"9th International Workshop on Video Processing and Consumer Electronics","author":"Staelens Nicolas","year":"2015","unstructured":"Nicolas Staelens, Margaret H. Pinson, Philip Corriveau, Filip De Turck, and Piet Demeester. 2015. Measuring video quality in the network: From quality of service to user experience. In 9th International Workshop on Video Processing and Consumer Electronics. 5\u20136."},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2011.5995364"},{"key":"e_1_3_2_39_2","first-page":"303","volume-title":"11th ACM International Conference on Multimedia","author":"Takemae Yoshinao","year":"2003","unstructured":"Yoshinao Takemae, Kazuhiro Otsuka, and Naoki Mukawa. 2003. Video cut editing rule based on participants\u2019 gaze in multiparty conversation. In 11th ACM International Conference on Multimedia. 303\u2013306."},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00281"},{"key":"e_1_3_2_41_2","doi-asserted-by":"crossref","unstructured":"Xiaohua Tong Zhen Ye Yusheng Xu Xinming Tang Shijie Liu Lingyun Li Huan Xie Fengxiang Wang Tianpeng Li and Zhonghua Hong. 2014. Framework of jitter detection and compensation for high resolution satellites. Rem. Sens. 6 5 (2014) 3944\u20133964.","DOI":"10.3390\/rs6053944"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00675"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1145\/2984511.2984569"},{"key":"e_1_3_2_45_2","first-page":"671","volume-title":"International Conference on Advances in Computer Entertainment","author":"Tsuchida Shuhei","year":"2017","unstructured":"Shuhei Tsuchida, Satoru Fukayama, and Masataka Goto. 2017. Automatic system for editing dance videos recorded using multiple cameras. In International Conference on Advances in Computer Entertainment. Springer, 671\u2013688."},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4020-6710-5_3"},{"key":"e_1_3_2_47_2","doi-asserted-by":"crossref","unstructured":"Miao Wang Guo-Wei Yang Shi-Min Hu Shing-Tung Yau and Ariel Shamir. 2019. Write-a-video: Computational video montage from themed text. ACM Trans. Graph. 38 6 (2019) 1\u201313.","DOI":"10.1145\/3355089.3356520"},{"key":"e_1_3_2_48_2","doi-asserted-by":"crossref","unstructured":"Mi Wang Ying Zhu Jun Pan Bo Yang and Quansheng Zhu. 2016. Satellite jitter detection and compensation using multispectral imagery. Rem. Sens. Lett. 7 6 (2016) 513\u2013522.","DOI":"10.1080\/2150704X.2016.1160298"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00513"},{"key":"e_1_3_2_50_2","doi-asserted-by":"crossref","unstructured":"Alper Yilmaz Xin Li and Mubarak Shah. 2004. Contour-based object tracking with occlusion handling in video acquired using mobile cameras. IEEE Trans. Pattern Anal. Mach. Intell. 26 11 (2004) 1531\u20131536.","DOI":"10.1109\/TPAMI.2004.96"},{"key":"e_1_3_2_51_2","doi-asserted-by":"crossref","unstructured":"Kai Zeng Yaonan Wang Jianxu Mao Junyang Liu Weixing Peng and Nankai Chen. 2018. A local metric for defocus blur detection based on CNN feature learning. IEEE Trans. Image Process. 28 5 (2018) 2107\u20132115.","DOI":"10.1109\/TIP.2018.2881830"},{"key":"e_1_3_2_52_2","doi-asserted-by":"crossref","unstructured":"Yingxue Zhang Yingbin Wang Feiyang Liu Zizheng Liu Yiming Li Daiqin Yang and Zhenzhong Chen. 2018. Subjective panoramic video quality assessment database for coding applications. IEEE Trans. Broadcast. 64 2 (2018) 461\u2013473.","DOI":"10.1109\/TBC.2018.2811627"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00325"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01246-5_9"},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/1995966.1996009"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3648681","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3648681","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:50:20Z","timestamp":1750287020000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3648681"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8,16]]},"references-count":54,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2024,9,30]]}},"alternative-id":["10.1145\/3648681"],"URL":"https:\/\/doi.org\/10.1145\/3648681","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2024,8,16]]},"assertion":[{"value":"2022-12-08","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-02-09","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-08-16","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}