{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:07:52Z","timestamp":1760144872703,"version":"build-2065373602"},"reference-count":83,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2024,5,24]],"date-time":"2024-05-24T00:00:00Z","timestamp":1716508800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Group-activity scene graph (GASG) generation is a challenging task in computer vision, aiming to anticipate and describe relationships between subjects and objects in video sequences. Traditional video scene graph generation (VidSGG) methods focus on retrospective analysis, limiting their predictive capabilities. To enrich the scene-understanding capabilities, we introduced a GASG dataset extending the JRDB dataset with nuanced annotations involving appearance, interaction, position, relationship, and situation attributes. This work also introduces an innovative approach, a Hierarchical Attention\u2013Flow (HAtt-Flow) mechanism, rooted in flow network theory to enhance GASG performance. Flow\u2013attention incorporates flow conservation principles, fostering competition for sources and allocation for sinks, effectively preventing the generation of trivial attention. Our proposed approach offers a unique perspective on attention mechanisms, where conventional \u201cvalues\u201d and \u201ckeys\u201d are transformed into sources and sinks, respectively, creating a novel framework for attention-based models. Through extensive experiments, we demonstrate the effectiveness of our Hatt-Flow model and the superiority of our proposed flow\u2013attention mechanism. This work represents a significant advancement in predictive video scene understanding, providing valuable insights and techniques for applications that require real-time relationship prediction in video data.<\/jats:p>","DOI":"10.3390\/s24113372","type":"journal-article","created":{"date-parts":[[2024,5,24]],"date-time":"2024-05-24T06:59:04Z","timestamp":1716533944000},"page":"3372","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group-Activity Scene Graph Generation in Videos"],"prefix":"10.3390","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0550-1176","authenticated-orcid":false,"given":"Naga Venkata Sai Raviteja","family":"Chappa","sequence":"first","affiliation":[{"name":"Department of EECS, University of Arkansas, Fayetteville, AR 72701, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1517-1382","authenticated-orcid":false,"given":"Pha","family":"Nguyen","sequence":"additional","affiliation":[{"name":"Department of EECS, University of Arkansas, Fayetteville, AR 72701, USA"}]},{"given":"Thi Hoang Ngan","family":"Le","sequence":"additional","affiliation":[{"name":"Department of EECS, University of Arkansas, Fayetteville, AR 72701, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1913-6488","authenticated-orcid":false,"given":"Page Daniel","family":"Dobbs","sequence":"additional","affiliation":[{"name":"Department of Health, Human Performance and Recreation, University of Arkansas, Fayetteville, AR 72701, USA"}]},{"given":"Khoa","family":"Luu","sequence":"additional","affiliation":[{"name":"Department of EECS, University of Arkansas, Fayetteville, AR 72701, USA"}]}],"member":"1968","published-online":{"date-parts":[[2024,5,24]]},"reference":[{"key":"ref_1","unstructured":"Gupta, S., and Malik, J. (2015). Visual semantic role labeling. arXiv."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Gkioxari, G., Girshick, R., Doll\u00e1r, P., and He, K. (2018, January 18\u201322). Detecting and recognizing human-object interactions. Proceedings of the IEEE Conference on Computer Vision and Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00872"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Kato, K., Li, Y., and Gupta, A. (2018, January 8\u201314). Compositional learning for human object interaction. Proceedings of the European Conference on Computer Vision, Munich, Germany.","DOI":"10.1007\/978-3-030-01264-9_15"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Chao, Y.W., Liu, Y., Liu, X., Zeng, H., and Deng, J. (2018, January 12\u201315). Learning to detect human-object interactions. Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA.","DOI":"10.1109\/WACV.2018.00048"},{"key":"ref_5","unstructured":"Wang, T., Anwer, R.M., Khan, M.H., Khan, F.S., Pang, Y., Shao, L., and Laaksonen, J. (November, January 27). Deep contextual attention for human-object interaction detection. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Li, Y.L., Zhou, S., Huang, X., Xu, L., Ma, Z., Fang, H.S., Wang, Y., and Lu, C. (2019, January 16\u201320). Transferable interactiveness knowledge for human-object interaction detection. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00370"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Zhou, T., Wang, W., Qi, S., Ling, H., and Shen, J. (2020, January 16\u201320). Cascaded human-object interaction recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR42600.2020.00432"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Wang, T., Yang, T., Danelljan, M., Khan, F.S., Zhang, X., and Sun, J. (2020, January 14\u201316). Learning human-object interaction detection using interaction points. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00417"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Hou, Z., Peng, X., Qiao, Y., and Tao, D. (2020, January 23\u201328). Visual compositional learning for human-object interaction detection. Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58555-6_35"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Li, Y.L., Liu, X., Lu, H., Wang, S., Liu, J., Li, J., and Lu, C. (2020, January 14\u201319). Detailed 2d\u20133d joint representation for human-object interaction. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01018"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Gao, C., Xu, J., Zou, Y., and Huang, J.B. (2020, January 23\u201328). Drg: Dual relation graph for human-object interaction detection. Proceedings of the 16th European Conference ECCV, Glasgow, UK.","DOI":"10.1007\/978-3-030-58610-2_41"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Kim, B., Choi, T., Kang, J., and Kim, H.J. (2020, January 23\u201328). Uniondet: Union-level detector towards real-time human-object interaction detection. Proceedings of the 16th European Conference ECCV, Glasgow, UK.","DOI":"10.1007\/978-3-030-58555-6_30"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Liu, Y., Chen, Q., and Zisserman, A. (2020, January 23\u201328). Amplifying key cues for human-object-interaction detection. Proceedings of the 16th European Conference ECCV, Glasgow, UK.","DOI":"10.1007\/978-3-030-58568-6_15"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Tamura, M., Ohashi, H., and Yoshinaga, T. (2021, January 19\u201325). Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Virtual.","DOI":"10.1109\/CVPR46437.2021.01027"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Hou, Z., Yu, B., Qiao, Y., Peng, X., and Tao, D. (2021, January 19\u201325). Affordance transfer learning for human-object interaction detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Virtual.","DOI":"10.1109\/CVPR46437.2021.00056"},{"key":"ref_16","first-page":"17209","article-title":"Mining the benefits of two-stage and one-stage hoi detection","volume":"34","author":"Zhang","year":"2021","journal-title":"NeurIPS"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Wang, S., Duan, Y., Ding, H., Tan, Y.P., Yap, K.H., and Yuan, J. (2022, January 19\u201324). Learning Transferable Human-Object Interaction Detector with Natural Language Supervision. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.00101"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Zhang, F.Z., Campbell, D., and Gould, S. (2022, January 19\u201324). Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, New Orleans, LA, USA.","DOI":"10.1109\/CVPR52688.2022.01947"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Kim, B., Lee, J., Kang, J., Kim, E.S., and Kim, H.J. (2021, January 19\u201325). Hotr: End-to-end human-object interaction detection with transformers. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Virtual.","DOI":"10.1109\/CVPR46437.2021.00014"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Zou, C., Wang, B., Hu, Y., Liu, J., Wu, Q., Zhao, Y., Li, B., Zhang, C., Zhang, C., and Wei, Y. (2021, January 19\u201325). End-to-end human object interaction detection with hoi transformer. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Virtual.","DOI":"10.1109\/CVPR46437.2021.01165"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Amer, M.R., Xie, D., Zhao, M., Todorovic, S., and Zhu, S.C. (2012, January 7\u201313). Cost-sensitive top-down\/bottom-up inference for multiscale activity recognition. Proceedings of the ECCV: 12th European Conference on Computer Vision, Florence, Italy.","DOI":"10.1007\/978-3-642-33765-9_14"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Amer, M.R., Todorovic, S., Fern, A., and Zhu, S.C. (2013, January 1\u20138). Monte carlo tree search for scheduling activity recognition. Proceedings of the ICCV International Conference on Computer Vision, Sydney, Australia.","DOI":"10.1109\/ICCV.2013.171"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Amer, M.R., Lei, P., and Todorovic, S. (2014, January 6\u201312). Hirf: Hierarchical random field for collective activity recognition in videos. Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10599-4_37"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"800","DOI":"10.1109\/TPAMI.2015.2465955","article-title":"Sum product networks for activity recognition","volume":"38","author":"Amer","year":"2015","journal-title":"IEEE Trans. Anal. Mach. Intell."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1549","DOI":"10.1109\/TPAMI.2011.228","article-title":"Discriminative latent models for recognizing contextual group activities","volume":"34","author":"Lan","year":"2011","journal-title":"IEEE Trans. Anal. Mach. Intell."},{"key":"ref_26","unstructured":"Lan, T., Sigal, L., and Mori, G. (2012, January 16\u201321). Social roles in hierarchical models for human activity recognition. Proceedings of the 2012 IEEE Conference on Computer Vision and Recognition, Providence, RI, USA."},{"key":"ref_27","unstructured":"Shu, T., Xie, D., Rothrock, B., Todorovic, S., and Chun Zhu, S. (2015, January 7\u201312). Joint inference of groups, events and human roles in aerial videos. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Boston, MA, USA."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Wang, Z., Shi, Q., Shen, C., and Van Den Hengel, A. (2013, January 23\u201328). Bilinear programming for human activity recognition with unknown mrf graphs. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Portland, OR, USA.","DOI":"10.1109\/CVPR.2013.221"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Zhang, H., Kyaw, Z., Chang, S.F., and Chua, T.S. (2017, January 21\u201326). Visual translation embedding network for visual relation detection. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.331"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"3820","DOI":"10.1109\/TPAMI.2020.2992222","article-title":"Contextual translation embedding for visual relationship detection and scene graph generation","volume":"43","author":"Hung","year":"2020","journal-title":"IEEE Trans. Anal. Mach. Intell."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Tang, K., Niu, Y., Huang, J., Shi, J., and Zhang, H. (2020, January 13\u201319). Unbiased scene graph generation from biased training. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Virtual.","DOI":"10.1109\/CVPR42600.2020.00377"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Desai, A., Wu, T.Y., Tripathi, S., and Vasconcelos, N. (2021, January 11\u201317). Learning of Visual Relations: The Devil is in the Tails. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Virtual.","DOI":"10.1109\/ICCV48922.2021.01512"},{"key":"ref_33","unstructured":"Liang, Y., Bai, Y., Zhang, W., Qian, X., Zhu, L., and Mei, T. (November, January 27). Vrr-vg: Refocusing visually-relevant relationships. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Khandelwal, S., Suhail, M., and Sigal, L. (2021, January 11\u201317). Segmentation-grounded Scene Graph Generation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Virtual.","DOI":"10.1109\/ICCV48922.2021.01558"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Bagautdinov, T., Alahi, A., Fleuret, F., Fua, P., and Savarese, S. (2017, January 21\u201326). Social scene understanding: End-to-end multi-person action localization and collective activity recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.365"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Deng, Z., Vahdat, A., Hu, H., and Mori, G. (2016, January 26\u201327). Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.516"},{"key":"ref_37","unstructured":"Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., and Mori, G. (July, January 26). A hierarchical deep temporal model for group activity recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Las Vegas, NV, USA."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Ibrahim, M.S., and Mori, G. (2018, January 8\u201314). Hierarchical relational networks for group activity recognition and retrieval. Proceedings of the IEEE\/CVF Conference on Computer Vision, Munich, Germany.","DOI":"10.1007\/978-3-030-01219-9_44"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Li, X., and Choo Chuah, M. (2017, January 22\u201329). Sbgar: Semantics based group activity recognition. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.313"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Qi, M., Qin, J., Li, A., Wang, Y., Luo, J., and Van Gool, L. (2018, January 8\u201314). Stagnet: An attentive semantic rnn for group activity recognition. Proceedings of the European Conference on Computer Vision, Munich, Germany.","DOI":"10.1007\/978-3-030-01249-6_7"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"1110","DOI":"10.1109\/TPAMI.2019.2942030","article-title":"Hierarchical long short-term concurrent memory for human interaction recognition","volume":"43","author":"Shu","year":"2019","journal-title":"IEEE Trans. Anal. Mach. Intell."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Wang, M., Ni, B., and Yang, X. (2017, January 21\u201326). Recurrent modeling of interaction context for collective activity recognition. Proceedings of the IEEE Conference on Computer Vision and Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.783"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Yan, R., Tang, J., Shu, X., Li, Z., and Tian, Q. (2018, January 22\u201326). Participation-contributed temporal dynamic model for group activity recognition. Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea.","DOI":"10.1145\/3240508.3240572"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Chappa, N.V., Nguyen, P., Nelson, A.H., Seo, H.S., Li, X., Dobbs, P.D., and Luu, K. (2023). SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition. arXiv.","DOI":"10.2139\/ssrn.4504147"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Chappa, N.V., Nguyen, P., Nelson, A.H., Seo, H.S., Li, X., Dobbs, P.D., and Luu, K. (2023, January 18\u201322). Spartan: Self-supervised spatiotemporal transformers approach to group activity recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPRW59228.2023.00544"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Gavrilyuk, K., Sanford, R., Javan, M., and Snoek, C.G. (2020, January 13\u201319). Actor-transformers for group activity recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00092"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Han, M., Zhang, D.J., Wang, Y., Yan, R., Yao, L., Chang, X., and Qiao, Y. (2022, January 19\u201324). Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, New Orleans, LO, USA.","DOI":"10.1109\/CVPR52688.2022.00300"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Tamura, M., Vishwakarma, R., and Vennelakanti, R. (2022, January 23\u201327). Hunting Group Clues with Transformers for Social Group Activity Recognition. Proceedings of the Computer Vision\u2013ECCV 2022: 17th European Conference, Tel Aviv, Israel. Proceedings, Part IV.","DOI":"10.1007\/978-3-031-19772-7_2"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D., Bernstein, M., and Fei-Fei, L. (2015, January 7\u201312). Image retrieval using scene graphs. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298990"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Dai, B., Zhang, Y., and Lin, D. (2017, January 21\u201326). Detecting visual relationships with deep relational networks. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition (CVPR 2017), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.352"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Zhang, J., Elhoseiny, M., Cohen, S., Chang, W., and Elgammal, A. (2017, January 21\u201326). Relationship proposal networks. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition (CVPR 2017), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.555"},{"key":"ref_52","unstructured":"Kolesnikov, A., Kuznetsova, A., Lampert, C., and Ferrari, V. (November, January 27). Detecting visual relationships using box attention. Proceedings of the International Conference on Computer Vision Workshops, Seoul, Republic of Korea."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Qi, M., Li, W., Yang, Z., Wang, Y., and Luo, J. (2019, January 16\u201320). Attentive relational networks for mapping images to scene graphs. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition (CVPR 2019), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00408"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Xu, D., Zhu, Y., Choy, C.B., and Fei-Fei, L. (2017, January 21\u201326). Scene graph generation by iterative message passing. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition (CVPR 2017), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.330"},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Zellers, R., Yatskar, M., Thomson, S., and Choi, Y. (2018, January 18\u201322). Neural motifs: Scene graph parsing with global context. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition (CVPR 2018), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00611"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Tang, K., Zhang, H., Wu, B., Luo, W., and Liu, W. (2019, January 16\u201320). Learning to Compose Dynamic Tree Structures for Visual Contexts. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition (CVPR 2019), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00678"},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Yang, J., Lu, J., Lee, S., Batra, D., and Parikh, D. (2018, January 8\u201314). Graph r-cnn for scene graph generation. Proceedings of the European Conference on Computer Vision (ECCV 20108), Minich, Germany.","DOI":"10.1007\/978-3-030-01246-5_41"},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Lin, X., Ding, C., Zeng, J., and Tao, D. (2020, January 8\u201314). Gps-net: Graph property sensing network for scene graph generation. Proceedings of the IEEE\/CVF European Conference on Computer Vision (CVPR 2020), Minich, Germany.","DOI":"10.1109\/CVPR42600.2020.00380"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Chen, T., Yu, W., Chen, R., and Lin, L. (2019, January 16\u201320). Knowledge-embedded routing network for scene graph generation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition (CVPR 2019), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00632"},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Li, Y., Ouyang, W., Zhou, B., Shi, J., Zhang, C., and Wang, X. (2018, January 8\u201314). Factorizable net: An efficient subgraph-based framework for scene graph generation. Proceedings of the IEEE\/CVF European Conference on Computer Vision, (ECCV 2018), Minich, Germany.","DOI":"10.1007\/978-3-030-01246-5_21"},{"key":"ref_61","doi-asserted-by":"crossref","unstructured":"Suhail, M., Mittal, A., Siddiquie, B., Broaddus, C., Eledath, J., Medioni, G., and Sigal, L. (2021, January 19\u201325). Energy-Based Learning for Scene Graph Generation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition (CVPR 2021), Virtual.","DOI":"10.1109\/CVPR46437.2021.01372"},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., and Ling, M. (2019, January 16\u201320). Scene graph generation with external knowledge and image reconstruction. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition (CVPR 2019), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00207"},{"key":"ref_63","doi-asserted-by":"crossref","unstructured":"Zareian, A., Karaman, S., and Chang, S.F. (2020, January August). Bridging knowledge graphs to generate scene graphs. Proceedings of the ECCV: 16th European Conference, Glasgow, UK.","DOI":"10.1007\/978-3-030-58592-1_36"},{"key":"ref_64","doi-asserted-by":"crossref","unstructured":"Zareian, A., Wang, Z., You, H., and Chang, S. (2020, January 23\u201328). Learning Visual Commonsense for Robust Scene Graph Generation. Proceedings of the ECCV: 16th European Conference, Glasgow, UK.","DOI":"10.1007\/978-3-030-58592-1_38"},{"key":"ref_65","doi-asserted-by":"crossref","unstructured":"Lu, C., Krishna, R., Bernstein, M., and Fei-Fei, L. (2016, January 11\u201314). Visual relationship detection with language priors. Proceedings of the European Conference on Computer Vision (ECCV): 14th European Conference, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46448-0_51"},{"key":"ref_66","doi-asserted-by":"crossref","unstructured":"Zhong, Y., Shi, J., Yang, J., Xu, C., and Li, Y. (2021, January 11\u201317). Learning to generate scene graph from natural language supervision. Proceedings of the IEEE\/CVF International Conference on Computer Vision, (ICCV 2021), Virtual.","DOI":"10.1109\/ICCV48922.2021.00184"},{"key":"ref_67","doi-asserted-by":"crossref","unstructured":"Ye, K., and Kovashka, A. (2021, January 19\u201325). Linguistic Structures as Weak Supervision for Visual Scene Graph Generation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition (CVPR 2021), Virtual.","DOI":"10.1109\/CVPR46437.2021.00819"},{"key":"ref_68","unstructured":"Nguyen, P., Quach, K.G., Kitani, K., and Luu, K. (2023, January 10\u201316). Type-to-track: Retrieve any object via prompt-based tracking. Proceedings of the NeurIPS 2023: 37th Annual Conference on Neural Information Processing Systems, New Orleans, LA, USA."},{"key":"ref_69","doi-asserted-by":"crossref","unstructured":"Nguyen, P., Truong, T.D., Huang, M., Liang, Y., Le, N., and Luu, K. (2022, January 16\u201319). Self-supervised domain adaptation in crowd counting. Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France.","DOI":"10.1109\/ICIP46576.2022.9897440"},{"key":"ref_70","unstructured":"Nguyen, T.T., Nguyen, P., and Luu, K. (2023). HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding. arXiv."},{"key":"ref_71","doi-asserted-by":"crossref","first-page":"108646","DOI":"10.1016\/j.patcog.2022.108646","article-title":"Non-volume preserving-based fusion to group-level emotion recognition on crowd videos","volume":"128","author":"Quach","year":"2022","journal-title":"Pattern Recognit."},{"key":"ref_72","doi-asserted-by":"crossref","unstructured":"Shang, X., Ren, T., Guo, J., Zhang, H., and Chua, T.S. (2017, January 23\u201327). Video visual relation detection. Proceedings of the 25th ACM international conference on Multimedia, Mountain View, CA, USA.","DOI":"10.1145\/3123266.3123380"},{"key":"ref_73","doi-asserted-by":"crossref","unstructured":"Teng, Y., Wang, L., Li, Z., and Wu, G. (2021, January 11\u201317). Target adaptive context aggregation for video scene graph generation. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Virtual.","DOI":"10.1109\/ICCV48922.2021.01343"},{"key":"ref_74","unstructured":"Dong, L., Gao, G., Zhang, X., Chen, L., and Wen, Y. (2019). Baconian: A Unified Open-source Framework for Model-Based Reinforcement Learning. arXiv."},{"key":"ref_75","unstructured":"Li, X., Guo, D., Liu, H., and Sun, F. (2022, January 8\u201311). Embodied semantic scene graph generation. Proceedings of the Conference on Robot Learning, London, UK."},{"key":"ref_76","doi-asserted-by":"crossref","unstructured":"Yang, J., Peng, W., Li, X., Guo, Z., Chen, L., Li, B., Ma, Z., Zhou, K., Zhang, W., and Loy, C.C. (2023, January 18\u201322). Panoptic video scene graph generation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Recognition, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.01791"},{"key":"ref_77","doi-asserted-by":"crossref","unstructured":"Caba Heilbron, F., Escorcia, V., Ghanem, B., and Carlos Niebles, J. (2015, January 7\u201312). Activitynet: A large-scale video benchmark for human activity understanding. Proceedings of the IEEE Conference on Computer Vision and Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"ref_78","unstructured":"Xu, J., Mei, T., Yao, T., and Rui, Y. (July, January 26). Msr-vtt: A large video description dataset for bridging video and language. Proceedings of the IEEE Conference on Computer Vision and Recognition, Las Vegas, NV, USA."},{"key":"ref_79","doi-asserted-by":"crossref","unstructured":"Pan, Y., Yao, T., Li, H., and Mei, T. (2017, January 21\u201326). Video captioning with transferred semantic attributes. Proceedings of the IEEE Conference on Computer Vision and Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.111"},{"key":"ref_80","doi-asserted-by":"crossref","unstructured":"Yang, J., Ang, Y.Z., Guo, Z., Zhou, K., Zhang, W., and Liu, Z. (2022, January 23\u201327). Panoptic scene graph generation. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.","DOI":"10.1007\/978-3-031-19812-0_11"},{"key":"ref_81","first-page":"6748","article-title":"Jrdb: A dataset and benchmark of egocentric robot visual perception of humans in built environments","volume":"45","author":"Patel","year":"2021","journal-title":"IEEE Trans. Anal. Mach. Intell."},{"key":"ref_82","unstructured":"Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., and Xing, E. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv."},{"key":"ref_83","unstructured":"Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozi\u00e8re, B., Goyal, N., Hambro, E., and Azhar, F. (2023). Llama: Open and efficient foundation language models. arXiv."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/11\/3372\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T14:48:02Z","timestamp":1760107682000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/24\/11\/3372"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,24]]},"references-count":83,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2024,6]]}},"alternative-id":["s24113372"],"URL":"https:\/\/doi.org\/10.3390\/s24113372","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2024,5,24]]}}}