{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:22:25Z","timestamp":1750220545315,"version":"3.41.0"},"reference-count":45,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2021,12,1]],"date-time":"2021-12-01T00:00:00Z","timestamp":1638316800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2022,6,30]]},"abstract":"<jats:p>Applications such as large-scale sparse linear algebra and graph analytics are challenging to accelerate on FPGAs due to the short irregular memory accesses, resulting in low cache hit rates. Nonblocking caches reduce the bandwidth required by misses by requesting each cache line only once, even when there are multiple misses corresponding to it. However, such reuse mechanism is traditionally implemented using an associative lookup. This limits the number of misses that are considered for reuse to a few tens, at most. In this article, we present an efficient pipeline that can process and store thousands of outstanding misses in cuckoo hash tables in on-chip SRAM with minimal stalls. This brings the same bandwidth advantage as a larger cache for a fraction of the area budget, because outstanding misses do not need a data array, which can significantly speed up irregular memory-bound latency-insensitive applications. In addition, we extend nonblocking caches to generate variable-length bursts to memory, which increases the bandwidth delivered by DRAMs and their controllers. The resulting miss-optimized memory system provides up to 25% speedup with 24\u00d7 area reduction on 15 large sparse matrix-vector multiplication benchmarks evaluated on an embedded and a datacenter FPGA system.<\/jats:p>","DOI":"10.1145\/3466823","type":"journal-article","created":{"date-parts":[[2021,12,1]],"date-time":"2021-12-01T13:42:41Z","timestamp":1638366161000},"page":"1-33","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Request, Coalesce, Serve, and Forget: Miss-Optimized Memory Systems for Bandwidth-Bound Cache-Unfriendly Applications on FPGAs"],"prefix":"10.1145","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8050-0042","authenticated-orcid":false,"given":"Mikhail","family":"Asiatici","sequence":"first","affiliation":[{"name":"School of Computer and Communication Sciences Ecole Polytechnique F\u00e9d\u00e9rale de Lausanne (EPFL), Lausanne, Switzerland"}]},{"given":"Paolo","family":"Ienne","sequence":"additional","affiliation":[{"name":"School of Computer and Communication Sciences Ecole Polytechnique F\u00e9d\u00e9rale de Lausanne (EPFL), Lausanne, Switzerland"}]}],"member":"320","published-online":{"date-parts":[[2021,12]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/1950413.1950421"},{"key":"e_1_3_2_3_2","volume-title":"AWS Shell Interface Specification","author":"Inc. Amazon.com,","year":"2020","unstructured":"Amazon.com, Inc.2020. AWS Shell Interface Specification. Retrieved from https:\/\/github.com\/aws\/aws-fpga\/blob\/master\/hdk\/docs\/AWS_Shell_Interface_Specification.md."},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.69"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.5555\/1987535.1987544"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/MDAT.2017.2750900"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.5555\/520549.822749"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/782814.782836"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.5555\/3195638.3195694"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3289602.3293899"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3061639.3062208"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/2049662.2049663"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/192007.192029"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1007\/s00224-004-1195-x"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2014.6927454"},{"volume-title":"Hybrid Memory Cube Controller IP Core User Guide","year":"2016","key":"e_1_3_2_16_2","unstructured":"Intel Inc. 2016. Hybrid Memory Cube Controller IP Core User Guide. Intel Inc."},{"volume-title":"Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual","year":"2018","key":"e_1_3_2_17_2","unstructured":"Intel Inc. 2018. Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual. Intel Inc."},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/1394608.1382172"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.3390\/electronics8050584"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.5555\/1543376"},{"volume-title":"DDR3 SDRAM Standard JESD79-3F","year":"2012","key":"e_1_3_2_21_2","unstructured":"JEDEC. 2012. DDR3 SDRAM Standard JESD79-3F. Retrieved from https:\/\/www.jedec.org\/standards-documents\/docs\/jesd-79-3d."},{"volume-title":"DDR4 SDRAM Standard JESD79-4B","year":"2017","key":"e_1_3_2_22_2","unstructured":"JEDEC. 2017. DDR4 SDRAM Standard JESD79-4B. Retrieved from https:\/\/www.jedec.org\/standards-documents\/docs\/jesd79-4a."},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/2989081.2989131"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.5555\/2039367"},{"key":"e_1_3_2_25_2","first-page":"751","volume-title":"Proceedings of the 45th Annual Allerton Conference on Communication, Control, and Computing","volume":"75","author":"Kirsch Adam","year":"2007","unstructured":"Adam Kirsch and Michael Mitzenmacher. 2007. Using a queue to de-amortize cuckoo hashing in hardware. In Proceedings of the 45th Annual Allerton Conference on Communication, Control, and Computing, Vol. 75. 751\u2013758."},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830830"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.5555\/800052.801868"},{"key":"e_1_3_2_28_2","volume-title":"Performance Impacts of Non-blocking Caches in Out-of-order Processors","author":"Li Sheng","year":"2011","unstructured":"Sheng Li, Ke Chen, Jay B. Brockman, and Norman P. Jouppi. 2011. Performance Impacts of Non-blocking Caches in Out-of-order Processors. HPL Tech Report."},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/2593069.2593105"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.micpro.2016.12.007"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/237578.237594"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/1394608.1382128"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/342001.339668"},{"key":"e_1_3_2_34_2","article-title":"Method and apparatus for out of order memory scheduling","author":"Rotithor Hemant G.","year":"2006","unstructured":"Hemant G. Rotithor, Randy B. Osborne, and Nagi Aboulenein. 2006. Method and apparatus for out of order memory scheduling. US Patent US7127574.","journal-title":"US Patent US7127574"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2007.346206"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/1450135.1450150"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2006.44"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/2847255"},{"key":"e_1_3_2_39_2","volume-title":"Understanding Latency Hiding on GPUs","author":"Volkov Vasily","year":"2016","unstructured":"Vasily Volkov. 2016. Understanding Latency Hiding on GPUs. Ph.D. Dissertation. UC Berkeley."},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/2989081.2989128"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPT.2017.8280127"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/2684746.2689073"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.5555\/645728.667682"},{"key":"e_1_3_2_44_2","volume-title":"AXI Register Slice v2.1 (PG373)","author":"Inc. Xilinx","year":"2020","unstructured":"Xilinx Inc.2020. AXI Register Slice v2.1 (PG373)."},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/2847263.2847283"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3020078.3021734"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3466823","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3466823","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:28:09Z","timestamp":1750195689000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3466823"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,12]]},"references-count":45,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,6,30]]}},"alternative-id":["10.1145\/3466823"],"URL":"https:\/\/doi.org\/10.1145\/3466823","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"type":"print","value":"1936-7406"},{"type":"electronic","value":"1936-7414"}],"subject":[],"published":{"date-parts":[[2021,12]]},"assertion":[{"value":"2021-01-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-05-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-12-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}