{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T02:49:30Z","timestamp":1760150970357,"version":"build-2065373602"},"reference-count":28,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2022,2,10]],"date-time":"2022-02-10T00:00:00Z","timestamp":1644451200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>As an important field of computer vision, object detection has been studied extensively in recent years. However, existing object detection methods merely utilize the visual information of the image and fail to mine the high-level semantic information of the object, which leads to great limitations. To take full advantage of multi-source information, a knowledge update-based multimodal object recognition model is proposed in this paper. Specifically, our method initially uses Faster R-CNN to regionalize the image, then applies a transformer-based multimodal encoder to encode visual region features (region-based image features) and textual features (semantic relationships between words) corresponding to pictures. After that, a graph convolutional network (GCN) inference module is introduced to establish a relational network in which the points denote visual and textual region features, and the edges represent their relationships. In addition, based on an external knowledge base, our method further enhances the region-based relationship expression capability through a knowledge update module. In summary, the proposed algorithm not only learns the accurate relationship between objects in different regions of the image, but also benefits from the knowledge update through an external relational database. Experimental results verify the effectiveness of the proposed knowledge update module and the independent reasoning ability of our model.<\/jats:p>","DOI":"10.3390\/s22041338","type":"journal-article","created":{"date-parts":[[2022,2,10]],"date-time":"2022-02-10T06:02:15Z","timestamp":1644472935000},"page":"1338","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Cross-Modal Object Detection Based on a Knowledge Update"],"prefix":"10.3390","volume":"22","author":[{"given":"Yueqing","family":"Gao","sequence":"first","affiliation":[{"name":"School of Electronic and Information Engineering, Beijing Jiaotong University, Beijing 100044, China"},{"name":"The 54th Research Institute of China Electronics Technology Group Corporation, Shijiazhuang 050081, China"}]},{"given":"Huachun","family":"Zhou","sequence":"additional","affiliation":[{"name":"School of Electronic and Information Engineering, Beijing Jiaotong University, Beijing 100044, China"}]},{"given":"Lulu","family":"Chen","sequence":"additional","affiliation":[{"name":"Center for Future Multimedia and School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China"},{"name":"The 54th Research Institute of China Electronics Technology Group Corporation, Shijiazhuang 050081, China"}]},{"given":"Yuting","family":"Shen","sequence":"additional","affiliation":[{"name":"National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China"},{"name":"University of Chinese Academy of Sciences, Beijing 100039, China"},{"name":"The 54th Research Institute of China Electronics Technology Group Corporation, Shijiazhuang 050081, China"}]},{"given":"Ce","family":"Guo","sequence":"additional","affiliation":[{"name":"The 54th Research Institute of China Electronics Technology Group Corporation, Shijiazhuang 050081, China"}]},{"given":"Xinyu","family":"Zhang","sequence":"additional","affiliation":[{"name":"Center for Future Multimedia and School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China"}]}],"member":"1968","published-online":{"date-parts":[[2022,2,10]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"96","DOI":"10.1109\/MSP.2017.2738401","article-title":"Deep multi-modal learning: A survey on recent advances and trends","volume":"34","author":"Ramachandram","year":"2017","journal-title":"IEEE Signal Processing Mag."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1007\/s13735-019-00187-6","article-title":"Characterization and classification of semantic image-text relations","volume":"9","author":"Otto","year":"2020","journal-title":"Int. J. Multimed. Inf. Retr."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1093\/biomet\/28.3-4.321","article-title":"Relations between two sets of variates","volume":"28","author":"Harold","year":"1936","journal-title":"Biometrika"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Li, D., Dimitrova, N., Li, M., and Sethi, I.K. (2003, January 2\u20138). Multimedia content processing through cross-modal association. Proceedings of the Eleventh ACM International Conference on Multimedia, Berkeley, CA, USA.","DOI":"10.1145\/957013.957143"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Sha, A., Wang, B., Wu, X., Zhang, L., Hu, B., and Zhang, J.Q. (August, January 28). Semi-Supervised Classification for Hyperspectral Images Using Edge-Conditioned Graph Convolutional Networks. Proceedings of the IGARSS 2019\u20142019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan.","DOI":"10.1109\/IGARSS.2019.8898688"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks","volume":"39","author":"Ren","year":"2017","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_7","unstructured":"Dai, J., Li, Y., He, K., and Sun, J. (2016, January 5\u201310). R-FCN: Object Detection via Region-based Fully Convolutional Networks. Proceedings of the Neural Information Processing Systems, Barcelona, Spain."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Redmon, J., and Farhadi, A. (2017, January 21\u201326). YOLO9000: Better, Faster, Stronger. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.690"},{"key":"ref_9","unstructured":"Wang, K., Yin, Q., Wang, W., Wu, S., and Wang, L. (2016). A comprehensive survey on cross-modal retrieval. arXiv."},{"key":"ref_10","unstructured":"Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Stratos, K., Han, X., Mensch, A., Berg, A., Berg, T., and Daum\u00e9, H. (2012, January 23\u201327). Midge: Generating image descriptions from computer vision detections. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., and Forsyth, D. (2010). Every picture tells a story: Generating sentences from images. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-642-15561-1_2"},{"key":"ref_12","unstructured":"Li, S., Kulkarni, G., Berg, T., Berg, A., and Choi, Y. (2011, January 23\u201324). Composing simple-image descriptions using web-scale ngrams. Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Portland, OR, USA."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"853","DOI":"10.1613\/jair.3994","article-title":"Framingimage description as a ranking task: Data, models and evaluation metrics","volume":"47","author":"Hodosh","year":"2013","journal-title":"J. Artif. Intell. Res."},{"key":"ref_14","unstructured":"Mao, J., Xu, W., Yang, Y., Wang, J., and Yuille, A.L. (2014). Explain images with multimodal recurrent neural networks. arXiv."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7\u201312). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern-Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Wu, Y., Wang, S., Song, G., and Huang, Q. (2019, January 21\u201325). Learning fragment self-attention embeddings for image-text matching. Proceedings of the 27th ACM International Conference on Multimedia, Nice, France.","DOI":"10.1145\/3343031.3350940"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Wang, Y., Yang, H., Qian, X., Ma, L., Lu, J., Li, B., and Fan, X. (2019). Position focused attention network for image-text matching. arXiv.","DOI":"10.24963\/ijcai.2019\/526"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Li, K., Zhang, Y., Li, K., Li, Y., and Fu, Y. (2019, January 27\u201328). Visual semantic reasoning for image-text matching. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00475"},{"key":"ref_20","first-page":"5998","article-title":"Attention is all you need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Adv. Neural Inf. Processing Syst."},{"key":"ref_21","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Tandon, N., De Melo, G., Suchanek, F., and Weikum, G. (2014, January 24\u201328). Webchild: Harvesting and organizing commonsense knowledge from the web. Proceedings of the 7th ACM International Conference on Web Search and Data Mining, New York, NY, USA.","DOI":"10.1145\/2556195.2556245"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. (2007, January 11\u201315). DBpedia: A nucleus for a web of open data. Proceedings of the 6th International the Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference, Busan, Korea.","DOI":"10.1007\/978-3-540-76298-0_52"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Speer, R., Chin, J., and Havasi, C. (2017, January 4\u20139). Concept Net 5.5: An open multilingual graph of general knowledge. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA.","DOI":"10.1609\/aaai.v31i1.11164"},{"key":"ref_25","unstructured":"Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., and Zitnick, C.L. (2014). Microsoft coco: Common objects in context. European Conference on Computer Vision, Springer.","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"ref_27","unstructured":"(2022, February 01). Available online: https:\/\/www.sohu.com\/a\/420471132_100062867."},{"key":"ref_28","unstructured":"(2022, February 01). Available online: http:\/\/www.mianfeiwendang.com\/doc\/f4560357b98ac18802bcd855\/5."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/4\/1338\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T22:17:23Z","timestamp":1760134643000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1424-8220\/22\/4\/1338"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,2,10]]},"references-count":28,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2022,2]]}},"alternative-id":["s22041338"],"URL":"https:\/\/doi.org\/10.3390\/s22041338","relation":{},"ISSN":["1424-8220"],"issn-type":[{"type":"electronic","value":"1424-8220"}],"subject":[],"published":{"date-parts":[[2022,2,10]]}}}