{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,18]],"date-time":"2025-12-18T14:09:32Z","timestamp":1766066972620,"version":"3.41.2"},"reference-count":57,"publisher":"Wiley","issue":"3","license":[{"start":{"date-parts":[[2019,2,27]],"date-time":"2019-02-27T00:00:00Z","timestamp":1551225600000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/onlinelibrary.wiley.com\/termsAndConditions#vor"}],"content-domain":{"domain":["onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["Concurrency and Computation"],"published-print":{"date-parts":[[2020,2,10]]},"abstract":"<jats:title>Summary<\/jats:title><jats:p>This paper explores key differences of MPI match lists for several important United States Department of Energy\u00a0(DOE) applications and proxy applications. This understanding is critical in determining the most promising hardware matching design for any given high\u2010speed network. The results of MPI match list studies for the major open\u2010source MPI implementations, MPICH and Open MPI, are presented, and we modify an MPI simulator, LogGOPSim, to provide match list statistics. These results are discussed in the context of several different potential design approaches to MPI matching\u2013capable hardware. The data illustrate the requirements for different hardware designs in terms of performance and memory capacity. This paper's contributions are the collection and analysis of data to help inform hardware designers of common MPI requirements and highlight the difficulties in determining these requirements by only examining a single MPI implementation.<\/jats:p>","DOI":"10.1002\/cpe.5150","type":"journal-article","created":{"date-parts":[[2019,2,27]],"date-time":"2019-02-27T21:05:50Z","timestamp":1551301550000},"update-policy":"https:\/\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Hardware MPI message matching: Insights into MPI matching behavior to inform design"],"prefix":"10.1002","volume":"32","author":[{"given":"Kurt","family":"Ferreira","sequence":"first","affiliation":[{"name":"Center for Computing Research Sandia National Laboratories  Albuquerque New Mexico"}]},{"given":"Ryan E.","family":"Grant","sequence":"additional","affiliation":[{"name":"Center for Computing Research Sandia National Laboratories  Albuquerque New Mexico"}]},{"given":"Michael J.","family":"Levenhagen","sequence":"additional","affiliation":[{"name":"Center for Computing Research Sandia National Laboratories  Albuquerque New Mexico"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2232-3201","authenticated-orcid":false,"given":"Scott","family":"Levy","sequence":"additional","affiliation":[{"name":"Center for Computing Research Sandia National Laboratories  Albuquerque New Mexico"}]},{"given":"Taylor","family":"Groves","sequence":"additional","affiliation":[{"name":"National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory  Berkeley California"}]}],"member":"311","published-online":{"date-parts":[[2019,2,27]]},"reference":[{"key":"e_1_2_8_2_1","doi-asserted-by":"crossref","unstructured":"UnderwoodKD BrightwellR.The impact of MPI queue usage on message latency. Paper presented at: International Conference on Parallel Processing (ICPP);2004;Montreal Canada.","DOI":"10.1109\/ICPP.2004.1327915"},{"key":"e_1_2_8_3_1","unstructured":"BrightwellR GoudyS UnderwoodK.A preliminary analysis of the MPI queue characteristics of several applications. Paper presented at: 2005 International Conference on Parallel Processing (ICPP);2005;Oslo Norway."},{"key":"e_1_2_8_4_1","doi-asserted-by":"crossref","unstructured":"BrightwellR PedrettiK FerreiraK.Instrumentation and analysis of MPI queue times on the SeaStar high\u2010performance network. In: Proceedings of the 17th International Conference on Computer Communications and Networks (ICCCN);2008;St. Thomas VI.","DOI":"10.1109\/ICCCN.2008.ECP.116"},{"key":"e_1_2_8_5_1","unstructured":"KellerR GrahamRL.Characteristics of the unexpected message queue of MPI applications. Paper presented at: European MPI Users' Group Meeting;2010;Stuttgart Germany."},{"key":"e_1_2_8_6_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2013.07.003"},{"key":"e_1_2_8_7_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-41321-1_15"},{"key":"e_1_2_8_8_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342014552085"},{"key":"e_1_2_8_9_1","unstructured":"GeoffrayP.Myrinet express (MX): is your interconnect smart?In: Proceedings of the High Performance Computing and Grid in Asia Pacific Region Seventh International Conference (HPCASIA);2004;Tokyo Japan."},{"key":"e_1_2_8_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/40.988689"},{"key":"e_1_2_8_11_1","unstructured":"BrightwellR UnderwoodKD.An analysis of NIC resource usage for offloading MPI. Paper presented at: 18th International Parallel and Distributed Processing Symposium (IPDPS);2004;Santa Fe NM."},{"key":"e_1_2_8_12_1","unstructured":"UnderwoodKD HemmertKS RodriguesA MurphyR BrightwellR.A hardware acceleration unit for MPI queue processing. Paper presented at: 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS);2005;Denver CO."},{"key":"e_1_2_8_13_1","doi-asserted-by":"crossref","unstructured":"HemmertKS UnderwoodKD RodriguesA.An architecture to perform NIC based MPI matching. Paper presented at: 2007 IEEE International Conference on Cluster Computing;2007;Austin TX.","DOI":"10.1109\/CLUSTR.2007.4629234"},{"key":"e_1_2_8_14_1","unstructured":"Understanding MPI tag matching and rendezvous offloads (ConnectX\u20105).https:\/\/community.mellanox.com\/docs\/DOC-2583. Accessed July 25 2018."},{"key":"e_1_2_8_15_1","doi-asserted-by":"crossref","unstructured":"DerradjiS Palfer\u2010SollierT PanzieraJP PoudesA AtosFW.The BXI interconnect architecture. In: Proceedings of the 2015 IEEE 23rd Annual Symposium on High\u2010Performance Interconnects (HOTI);2015;Santa Clara CA.","DOI":"10.1109\/HOTI.2015.15"},{"key":"e_1_2_8_16_1","doi-asserted-by":"publisher","DOI":"10.2172\/1365498"},{"key":"e_1_2_8_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2006.65"},{"volume-title":"The Portals 3.3 Message Passing Interface Document Revision 2.1","year":"2006","author":"Riesen R","key":"e_1_2_8_18_1"},{"key":"e_1_2_8_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/40.342015"},{"key":"e_1_2_8_20_1","unstructured":"The Open MPI Development Team.Open MPI.https:\/\/www.open-mpi.org.2017. Accessed March 28 2017."},{"key":"e_1_2_8_21_1","unstructured":"Team MPICH Development.MPICH.https:\/\/www.mpich.org.2017. Accessed March 30 2017."},{"key":"e_1_2_8_22_1","doi-asserted-by":"crossref","unstructured":"HoeflerT SchneiderT LumsdaineA.LogGOPSim: simulating large\u2010scale applications in the LogGOPS model. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC);2010;Chicago IL.","DOI":"10.1145\/1851476.1851564"},{"key":"e_1_2_8_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/11752578_29"},{"key":"e_1_2_8_24_1","doi-asserted-by":"publisher","DOI":"10.1016\/0167-8191(96)00024-5"},{"key":"e_1_2_8_25_1","unstructured":"Cray XC40 programming environment user's guide.https:\/\/pubs.cray.com\/pdf-attachments\/attachment?pubId=00463350-DA&attachmentId=pub_00463350-DA.pdf. Accessed July 25 2018"},{"volume-title":"Cray\u00ae XC\u2122 Series Network","year":"2012","author":"Alverson B","key":"e_1_2_8_26_1"},{"key":"e_1_2_8_27_1","unstructured":"Message Passing Interface Forum.MPI: a message\u2010passing interface standard version 3.1.2015.http:\/\/mpi-forum.org\/docs\/mpi-3.1\/mpi31-report.pdf"},{"key":"e_1_2_8_28_1","unstructured":"KeppitiyagamaC WagnerA.Asynchronous MPI messaging on Myrinet. In: Proceedings 15th International Parallel and Distributed Processing Symposium (IPDPS);2001;San Francisco CA."},{"key":"e_1_2_8_29_1","doi-asserted-by":"crossref","unstructured":"CullerD KarpR PattersonD et al.LogP: towards a realistic model of parallel computation. In: Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP);1993;San Diego CA.","DOI":"10.1145\/155332.155333"},{"key":"e_1_2_8_30_1","doi-asserted-by":"crossref","unstructured":"HoeflerT SchneiderT LumsdaineA.Characterizing the influence of system noise on large\u2010scale applications by simulation. In: Proceedings of the 2010 ACM\/IEEE International Conference for High Performance Computing Networking Storage and Analysis (SC);2010;New Orleans LA.","DOI":"10.1109\/SC.2010.12"},{"key":"e_1_2_8_31_1","doi-asserted-by":"crossref","unstructured":"LevyS ToppB FerreiraKB ArnoldD HoeflerT WidenerP.Using simulation to evaluate the performance of resilience strategies at scale. Paper presented at: International Workshop on Performance Modeling Benchmarking and Simulation of High Performance Computer Systems;2013;Denver CO.","DOI":"10.1007\/978-3-319-10214-6_5"},{"key":"e_1_2_8_32_1","doi-asserted-by":"crossref","unstructured":"LevyS FerreiraKB WidenerP BridgesPG MondragonOH.How I learned to stop worrying and love in Situ analytics: leveraging latent synchronization in MPI collective algorithms. In: Proceedings of the 23rd European MPI Users' Group Meeting ACM (EuroMPI);2016;Edinburgh UK.","DOI":"10.1145\/2966884.2966920"},{"key":"e_1_2_8_33_1","doi-asserted-by":"crossref","unstructured":"MondragonOH BridgesPG LevyS FerreiraKB WidenerP.Understanding performance interference in next\u2010generation HPC systems. In: Proceedings of the International Conference for High Performance Computing Networking Storage and Analysis (SC);2016;Salt Lake City UT.","DOI":"10.1109\/SC.2016.32"},{"key":"e_1_2_8_34_1","unstructured":"BrightwellR SkjellumA.MPICH on the T3D: a case study of high performance message passing. In: Proceedings of the Second MPI Developers Conference (MPIDC);1996;Notre Dame IN."},{"key":"e_1_2_8_35_1","doi-asserted-by":"crossref","unstructured":"GhazimirsaeedSM GrantRE AfsahiA.A dedicated message matching mechanism for collective communications. In: Proceedings of the 47th International Conference on Parallel Processing Companion (ICPP);2018;Eugene OR.","DOI":"10.1145\/3229710.3229712"},{"key":"e_1_2_8_36_1","unstructured":"Cray XC40 specifications.https:\/\/www.cray.com\/sites\/default\/files\/resources\/cray_xc40_specifications.pdf. Accessed July 25 2018."},{"key":"e_1_2_8_37_1","doi-asserted-by":"publisher","DOI":"10.1006\/jcph.1995.1039"},{"key":"e_1_2_8_38_1","unstructured":"Sandia National Laboratories.LAMMPS molecular dynamics simulator.2013.http:\/\/lammps.sandia.gov"},{"key":"e_1_2_8_39_1","unstructured":"Lawrence Livermore National Laboratory.Co\u2010design at Lawrence Livermore National Laboratory: Livermore unstructured lagrangian explicit shock hydrodynamics (LULESH).2015.http:\/\/codesign.llnl.gov\/lulesh.php. Retrieved June 10 2015."},{"key":"e_1_2_8_40_1","unstructured":"Indiana University.HPCG benchmark.http:\/\/www.hpcg-benchmark.org. Retrieved September2017."},{"key":"e_1_2_8_41_1","unstructured":"Laboratories Sandia National Tennessee Knoxville University.MIMD lattice computation (MILC) collaboration.2017.http:\/\/physics.indiana.edu\/~sg\/milc.html. Retrieved September 2017."},{"key":"e_1_2_8_42_1","doi-asserted-by":"publisher","DOI":"10.1088\/1742-6596\/78\/1\/012001"},{"key":"e_1_2_8_43_1","unstructured":"Mellanox.Fabric collective accelerator. Retrieved September2007.http:\/\/www.mellanox.com\/related-docs\/prod_acceleration_software\/FCA.pdf"},{"key":"e_1_2_8_44_1","unstructured":"Mellanox.Scalable hierarchical aggregation protocol. Retrieved September2007.http:\/\/www.mellanox.com\/related-docs\/prod_acceleration_software\/Mellanox_SHARP_SW_API_Guide.pdf"},{"key":"e_1_2_8_45_1","unstructured":"Cisco.Cisco nexus 9000 series.2017.https:\/\/www.cisco.com\/c\/en\/us\/td\/docs\/switches\/datacenter\/nexus9000\/sw\/6-x\/security\/configuration\/guide\/b_Cisco_Nexus_9000_Series_NX-OS_Security_Configuration_Guide\/b_Cisco_Nexus_9000_Series_NX-OS_Security_Configuration_Guide_chapter_01010.pdf"},{"key":"e_1_2_8_46_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2017.08.003"},{"key":"e_1_2_8_47_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342014548772"},{"key":"e_1_2_8_48_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2018.05.005"},{"key":"e_1_2_8_49_1","doi-asserted-by":"crossref","unstructured":"SchonbeinW DosanjhMG GrantRE BridgesPG.Measuring multithreaded message matching misery. In: Proceedings of the International European Conference on Parallel and Distributed Computing;2018;Turin Italy.","DOI":"10.1007\/978-3-319-96983-1_34"},{"key":"e_1_2_8_50_1","doi-asserted-by":"crossref","unstructured":"RaffenettiK AmerA OdenL et al.Why is MPI so slow?: analyzing the fundamental limits in implementing MPI\u20103.1. In: Proceedings of the International Conference for High Performance Computing Networking Storage and Analysis (SC);2017;Denver CO.","DOI":"10.1145\/3126908.3126963"},{"key":"e_1_2_8_51_1","doi-asserted-by":"crossref","unstructured":"DosanjhMG GhazimirsaeedSM GrantRE et al.The case for semi\u2010permanent cache occupancy: understanding the impact of data locality on network processing. In: Proceedings of the 47th International Conference on Parallel Processing (ICPP);2018;Eugene OR.","DOI":"10.1145\/3225058.3225130"},{"key":"e_1_2_8_52_1","doi-asserted-by":"crossref","unstructured":"HjelmN DosanjhMG GrantRE GrovesT BridgesP ArnoldD.Improving MPI multi\u2010threaded RMA communication performance. In: Proceedings of the 47th International Conference on Parallel Processing (ICPP);2018;Eugene OR.","DOI":"10.1145\/3225058.3225114"},{"key":"e_1_2_8_53_1","doi-asserted-by":"crossref","unstructured":"BridgesPG DosanjhMG GrantR SkjellumA FarmerS BrightwellR.Preparing for exascale: modeling MPI for many\u2010core systems using fine\u2010grain queues. In: Proceedings of the 3rd Workshop on Exascale MPI (ExaMPI);2015;Austin TX.","DOI":"10.1145\/2831129.2831134"},{"key":"e_1_2_8_54_1","doi-asserted-by":"crossref","unstructured":"StarkDT BarrettRF GrantRE OlivierSL PedrettiKT VaughanCT.Early experiences co\u2010scheduling work and communication tasks for hybrid MPI+X applications. Paper presented at: 2014 Workshop on Exascale MPI at Supercomputing Conference;2014;New Orleans LA.","DOI":"10.1109\/ExaMPI.2014.6"},{"key":"e_1_2_8_55_1","doi-asserted-by":"crossref","unstructured":"BarrettRF StarkDT VaughanCT GrantRE OlivierSL PedrettiKT.Toward an evolutionary task parallel integrated MPI+X programming model. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM);2015;San Francisco CA.","DOI":"10.1145\/2712386.2712388"},{"key":"e_1_2_8_56_1","doi-asserted-by":"crossref","unstructured":"DosanjhMG GrantRE BridgesPG BrightwellR.Re\u2010evaluating network onload vs. offload for the many\u2010core era. Paper presented at: 2015 IEEE International Conference on Cluster Computing;2015;Chicago IL.","DOI":"10.1109\/CLUSTER.2015.55"},{"key":"e_1_2_8_57_1","doi-asserted-by":"crossref","unstructured":"SchneiderT HoeflerT GrantRE BarrettB BrightwellR.Protocols for fully offloaded collective operations on accelerated network adapters. Paper presented at: 2013 42nd International Conference on Parallel Processing;2013;Lyon France.","DOI":"10.1109\/ICPP.2013.73"},{"key":"e_1_2_8_58_1","doi-asserted-by":"crossref","unstructured":"HoeflerT Di\u00a0GirolamoS TaranovK GrantRE BrightwellR.spin: high\u2010performance streaming processing in the network. In: Proceedings of the International Conference for High Performance Computing Networking Storage and Analysis (SC);2017;Denver CO.","DOI":"10.1145\/3126908.3126970"}],"container-title":["Concurrency and Computation: Practice and Experience"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/api.wiley.com\/onlinelibrary\/tdm\/v1\/articles\/10.1002%2Fcpe.5150","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/cpe.5150","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/full-xml\/10.1002\/cpe.5150","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/cpe.5150","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,2]],"date-time":"2023-09-02T09:28:07Z","timestamp":1693646887000},"score":1,"resource":{"primary":{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/cpe.5150"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,2,27]]},"references-count":57,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2020,2,10]]}},"alternative-id":["10.1002\/cpe.5150"],"URL":"https:\/\/doi.org\/10.1002\/cpe.5150","archive":["Portico"],"relation":{},"ISSN":["1532-0626","1532-0634"],"issn-type":[{"type":"print","value":"1532-0626"},{"type":"electronic","value":"1532-0634"}],"subject":[],"published":{"date-parts":[[2019,2,27]]},"assertion":[{"value":"2017-10-10","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-10-25","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-02-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}],"article-number":"e5150"}}