{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T17:07:50Z","timestamp":1774631270650,"version":"3.50.1"},"reference-count":54,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2024,9,30]],"date-time":"2024-09-30T00:00:00Z","timestamp":1727654400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"}],"funder":[{"name":"NSF","award":["#2213701, #2217003, #2133267, #2122320, #2324864, #2328972"],"award-info":[{"award-number":["#2213701, #2217003, #2133267, #2122320, #2324864, #2328972"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2024,9,30]]},"abstract":"<jats:p>\n            Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD\/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic with AI Engine processors optimized for AI\/ML. An array of 400 AI Engine processors executing at 1 GHz can provide up to 6.4 TFLOPS performance for 32-bit floating-point (FP32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. We observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises:\n            <jats:italic>How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes?<\/jats:italic>\n          <\/jats:p>\n          <jats:p>\n            We identify the biggest system throughput bottleneck resulting from the mismatch between the massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose\n            <jats:italic>multiple diverse MM accelerator architectures<\/jats:italic>\n            working concurrently on different layers within one application. CHARM includes analytical models that guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework on four different deep learning applications in FP32, INT16, and INT8 data types, including BERT, ViT, NCF, and MLP, on the AMD\/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPS, 1.61 TFLOPS, 1.74 TFLOPS, and 2.94 TFLOPS inference throughput for BERT, ViT, NCF, and MLP in FP32 data type, respectively, which obtain 5.29\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            , 32.51\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            , 1.00\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            , and 1.00\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            throughput gains compared to one monolithic accelerator. CHARM achieves the maximum throughput of 1.91 TOPS, 1.18 TOPS, 4.06 TOPS, and 5.81 TOPS in the INT16 data type for the four applications. The maximum throughput achieved by CHARM in the INT8 data type is 3.65 TOPS, 1.28 TOPS, 10.19 TOPS, and 21.58 TOPS, respectively. We have open-sourced our tools, including detailed step-by-step guides to reproduce all the results presented in this article and to enable other users to learn and leverage CHARM framework and tools in their end-to-end systems:\n            <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"url\" xlink:href=\"https:\/\/github.com\/arc-research-lab\/CHARM\">https:\/\/github.com\/arc-research-lab\/CHARM<\/jats:ext-link>\n            .\n          <\/jats:p>","DOI":"10.1145\/3686163","type":"journal-article","created":{"date-parts":[[2024,8,5]],"date-time":"2024-08-05T15:51:50Z","timestamp":1722873110000},"page":"1-31","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture"],"prefix":"10.1145","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3659-339X","authenticated-orcid":false,"given":"Jinming","family":"Zhuang","sequence":"first","affiliation":[{"name":"University of Pittsburgh, Pittsburgh, PA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0751-8227","authenticated-orcid":false,"given":"Jason","family":"Lau","sequence":"additional","affiliation":[{"name":"University of California, Los Angeles, Los Angeles, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6646-8146","authenticated-orcid":false,"given":"Hanchen","family":"Ye","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, Urbana, IL, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7655-4080","authenticated-orcid":false,"given":"Zhuoping","family":"Yang","sequence":"additional","affiliation":[{"name":"University of Pittsburgh, Pittsburgh, PA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-3429-4692","authenticated-orcid":false,"given":"Shixin","family":"Ji","sequence":"additional","affiliation":[{"name":"University of Pittsburgh, Pittsburgh, PA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4918-478X","authenticated-orcid":false,"given":"Jack","family":"Lo","sequence":"additional","affiliation":[{"name":"Advanced Micro Devices Inc, San Jose, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6668-4562","authenticated-orcid":false,"given":"Kristof","family":"Denolf","sequence":"additional","affiliation":[{"name":"Advanced Micro Devices Inc, San Jose, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2956-8428","authenticated-orcid":false,"given":"Stephen","family":"Neuendorffer","sequence":"additional","affiliation":[{"name":"Advanced Micro Devices Inc, San Jose, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7498-0206","authenticated-orcid":false,"given":"Alex","family":"Jones","sequence":"additional","affiliation":[{"name":"University of Pittsburgh, Pittsburgh, PA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4029-4034","authenticated-orcid":false,"given":"Jingtong","family":"Hu","sequence":"additional","affiliation":[{"name":"University of Pittsburgh, Pittsburgh, PA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6788-9823","authenticated-orcid":false,"given":"Yiyu","family":"Shi","sequence":"additional","affiliation":[{"name":"University of Notre Dame, Notre Dame, IN, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3016-0270","authenticated-orcid":false,"given":"Deming","family":"Chen","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, Urbana, IL, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2887-6963","authenticated-orcid":false,"given":"Jason","family":"Cong","sequence":"additional","affiliation":[{"name":"University of California, Los Angeles, Los Angeles, CA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0493-1844","authenticated-orcid":false,"given":"Peipei","family":"Zhou","sequence":"additional","affiliation":[{"name":"University of Pittsburgh, Pittsburgh, PA, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,9,30]]},"reference":[{"key":"e_1_3_1_2_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 30."},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3038912.3052569"},{"key":"e_1_3_1_4_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_1_5_2","unstructured":"Yu Emma Wang Gu-Yeon Wei and David Brooks. 2019. Benchmarking TPU GPU and CPU platforms for deep learning. arXiv:1907.10701. Retrieved from https:\/\/arxiv.org\/abs\/1907.10701"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-981-16-7487-7_7"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3110993"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3307650.3322231"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.Companion.2012.351"},{"key":"e_1_3_1_11_2","unstructured":"AMD\/Xilinx. n.d. Versal Adaptive Compute Acceleration Platform."},{"key":"e_1_3_1_12_2","unstructured":"AMD. 2022. IP Overlays of Deep learning Processing Unit 2022."},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001177"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/JETCAS.2019.2910232"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750389"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2019.00035"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICFPT51103.2020.00011"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00012"},{"key":"e_1_3_1_19_2","first-page":"1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Matteis Tiziano De","year":"2020","unstructured":"Tiziano De Matteis, Johannes de Fine Licht, and Torsten Hoefler. 2020. FBLAS: Streaming linear algebra on FPGA. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1\u201313."},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373087.3375296"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/2684746.2689060"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3174243.3174258"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439292"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3490422.3502357"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3489517.3530420"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2018.00028"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2016.50"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439304"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240765.3240801"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3400302.3415609"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3037697.3037702"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304014"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00016"},{"key":"e_1_3_1_34_2","unstructured":"Nvidia. Website. Retrieved from http:\/\/nvdla.org\/"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2014.12"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/2333660.2333747"},{"key":"e_1_3_1_37_2","unstructured":"AMD\/Xilinx. 2022. Versal AI Core Series VCK190 Evaluation Kit."},{"key":"e_1_3_1_38_2","unstructured":"AMD\/Xilinx. 2022. AI Engine Technology."},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2011.2110592"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3530775"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/SASP.2009.5226333"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2011.29"},{"key":"e_1_3_1_43_2","first-page":"1","article-title":"High-level synthesis: Productivity, performance, and software constraints","author":"Liang Yun","year":"2012","unstructured":"Yun Liang, Kyle Rupnow, Yinan Li, Dongbo Min, Minh N Do, and Deming Chen. 2012. High-level synthesis: Productivity, performance, and software constraints. Journal of Electrical and Computer Engineering 2012 (2012), Article 1, 1 page.","journal-title":"Journal of Electrical and Computer Engineering"},{"key":"e_1_3_1_44_2","first-page":"204","volume-title":"Proceedings of the IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","author":"Chi Yuze","year":"2021","unstructured":"Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, and Jason Cong. 2021. Extending high-level synthesis for task-parallel programs. In Proceedings of the IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 204\u2013213."},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240765.3240838"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/DAC56929.2023.10248016"},{"key":"e_1_3_1_47_2","unstructured":"AMD\/Xilinx. n.d. Adaptive Data Flow API."},{"key":"e_1_3_1_48_2","unstructured":"AMD\/Xilinx. n.d. Board evaluation and management Tool."},{"key":"e_1_3_1_49_2","unstructured":"AMD\/Xilinx. n.d. Xilinx Board Utility (Xbutil)."},{"key":"e_1_3_1_50_2","unstructured":"AMD\/Xilinx. n.d. AI Engine API and Intrinsics User Guide."},{"key":"e_1_3_1_51_2","unstructured":"AMD\/Xilinx. n.d. Versal\u2122 ACAP AI Engine System C simulator."},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL57034.2022.00040"},{"key":"e_1_3_1_53_2","unstructured":"Gurobi. Website. Retrieved from https:\/\/www.gurobi.com\/"},{"key":"e_1_3_1_54_2","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2022. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Proceedings of the Advances in Neural Information Processing Systems."},{"key":"e_1_3_1_55_2","unstructured":"Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning. arXiv:2307.08691. Retrieved from https:\/\/arxiv.org\/abs\/2307.08691"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3686163","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3686163","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:17:50Z","timestamp":1750295870000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3686163"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,30]]},"references-count":54,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,9,30]]}},"alternative-id":["10.1145\/3686163"],"URL":"https:\/\/doi.org\/10.1145\/3686163","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,9,30]]},"assertion":[{"value":"2023-08-31","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-06-16","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-09-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}