{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T08:48:14Z","timestamp":1773305294689,"version":"3.50.1"},"reference-count":35,"publisher":"Wiley","issue":"7","license":[{"start":{"date-parts":[[2015,5,18]],"date-time":"2015-05-18T00:00:00Z","timestamp":1431907200000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/onlinelibrary.wiley.com\/termsAndConditions#vor"}],"funder":[{"DOI":"10.13039\/501100002347","name":"Bundesministerium f\u00fcr Bildung und Forschung","doi-asserted-by":"publisher","award":["01IH08003A"],"award-info":[{"award-number":["01IH08003A"]}],"id":[{"id":"10.13039\/501100002347","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Concurrency and Computation"],"published-print":{"date-parts":[[2016,5]]},"abstract":"<jats:title>Summary<\/jats:title><jats:p>Memory\u2010bound algorithms show complex performance and energy consumption behavior on multicore processors. We choose the lattice Boltzmann method on an Intel Sandy Bridge cluster as a prototype scenario to investigate if and how single\u2010chip performance and power characteristics can be generalized to the highly parallel case. First, we perform an analysis of a sparse\u2010lattice lattice Boltzmann method implementation for complex geometries. Using a single\u2010core performance model, we predict the intra\u2010chip saturation characteristics and the optimal operating point in terms of energy\u2010to\u2010solution as a function of implementation details, clock frequency, vectorization, and number of active cores per chip. We show that high single\u2010core performance and a correct choice of the number of active cores per chip are the essential optimizations for the lowest energy\u2010to\u2010solution at minimal performance degradation. Then we extrapolate to the Message Passing Interface (MPI)\u2010parallel level and quantify the energy\u2010saving potential of various optimizations and execution modes, where we find these guidelines to be even more important, especially when communication overhead is non\u2010negligible. In our setup, we could achieve energy savings of 35% in this case, compared with a naive approach. We also demonstrate that a simple non\u2010reflective reduction of the clock speed leaves most of the energy\u2010saving potential unused. Copyright \u00a9 2015 John Wiley &amp; Sons, Ltd.<\/jats:p>","DOI":"10.1002\/cpe.3489","type":"journal-article","created":{"date-parts":[[2015,5,18]],"date-time":"2015-05-18T20:14:12Z","timestamp":1431980052000},"page":"2295-2315","source":"Crossref","is-referenced-by-count":26,"title":["Chip\u2010level and multi\u2010node analysis of energy\u2010optimized lattice Boltzmann CFD simulations"],"prefix":"10.1002","volume":"28","author":[{"given":"Markus","family":"Wittmann","sequence":"first","affiliation":[{"name":"Erlangen Regional Computing Center (RRZE)  Martensstr. 1 91058 Erlangen Germany"}]},{"given":"Georg","family":"Hager","sequence":"additional","affiliation":[{"name":"Erlangen Regional Computing Center (RRZE)  Martensstr. 1 91058 Erlangen Germany"}]},{"given":"Thomas","family":"Zeiser","sequence":"additional","affiliation":[{"name":"Erlangen Regional Computing Center (RRZE)  Martensstr. 1 91058 Erlangen Germany"}]},{"given":"Jan","family":"Treibig","sequence":"additional","affiliation":[{"name":"Erlangen Regional Computing Center (RRZE)  Martensstr. 1 91058 Erlangen Germany"}]},{"given":"Gerhard","family":"Wellein","sequence":"additional","affiliation":[{"name":"Erlangen Regional Computing Center (RRZE)  Martensstr. 1 91058 Erlangen Germany"}]}],"member":"311","published-online":{"date-parts":[[2015,5,18]]},"reference":[{"key":"e_1_2_8_2_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2008.01.013"},{"key":"e_1_2_8_3_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cpc.2003.12.003"},{"key":"e_1_2_8_4_1","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.72.016706"},{"key":"e_1_2_8_5_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.compfluid.2005.02.008"},{"key":"e_1_2_8_6_1","doi-asserted-by":"crossref","unstructured":"BernaschiM SucciS FytaM KaxirasE MelchionnaS SircarJK.MUPHY: a parallel high performance MUlti PHYsics\/Scale code.IEEE International Symposium on Parallel and Distributed Processing IPDPS 2008 (IPDPS) 2008;1\u20138. DOI:10.1109\/IPDPS.2008.4536464.","DOI":"10.1109\/IPDPS.2008.4536464"},{"key":"e_1_2_8_7_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.camwa.2007.08.001"},{"key":"e_1_2_8_8_1","doi-asserted-by":"crossref","unstructured":"BaileyP MyreJ WalshSDC LiljaDJ SaarMO.Accelerating lattice Boltzmann fluid flow simulations using graphics processors.IEEE International Conference on Parallel Processing 2009 (ICPP'09) 2009;550\u2013557. DOI:10.1109\/ICPP.2009.38.","DOI":"10.1109\/ICPP.2009.38"},{"key":"e_1_2_8_9_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.compfluid.2009.09.011"},{"key":"e_1_2_8_10_1","unstructured":"ZudropJ KlimachH HasertM MasilamaniK RollerS.A fully distributed CFD framework for massively parallel systems.Cray Users Group Conference 2011 Stuttgart Germany. (Available from:https:\/\/cug.org\/proceedings\/attendee_program_cug2012\/includes\/files\/pap136.pdf) [Accessed on 26 March 2015] 2012."},{"key":"e_1_2_8_11_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.camwa.2012.05.002"},{"key":"e_1_2_8_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1498765.1498785"},{"key":"e_1_2_8_13_1","doi-asserted-by":"crossref","unstructured":"PetersA MelchionnaS KaxirasE L\u00e4ttJ SircarJK BernaschiM BissonM SucciS.Multiscale simulation of cardiovascular flows on the IBM Bluegene\/P: full heart\u2010circulation system at red\u2010blood cell resolution.Proceedings of the ACM\/IEEE International Conference for High Performance Computing Networking and Storage SC 2010 New Orleans LA USA IEEE November 13\u201319 2010;1\u201310. DOI:10.1109\/SC.2010.33.","DOI":"10.1109\/SC.2010.33"},{"key":"e_1_2_8_14_1","unstructured":"CarterJ SoeM OlikerL TsudaY VahalaG VahalaL MacnabA.Magnetohydrodynamic turbulence simulations on the earth simulator using the lattice Boltzmann method.Proceedings of the ACM\/IEEE International Conference for High Performance Computing Networking and Storage (SC05) Seattle WA November 12\u201318 2005."},{"key":"e_1_2_8_15_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2009.04.002"},{"key":"e_1_2_8_16_1","unstructured":"KellerV.Energy\u2010to\u2010solution: a today's metric for tomorrow's concerns.Talk at the Symposium on Future Generations of Processors and Systems (FGPS'175) Mons Belgium November 9 2012. (Available from:http:\/\/www.ig.fpms.ac.be\/sites\/default\/files\/FGPS175-Keller.pdf) [Accessed on 26 March 2015]."},{"key":"e_1_2_8_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-40047-6_1"},{"key":"e_1_2_8_18_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00450-010-0119-z"},{"key":"e_1_2_8_19_1","doi-asserted-by":"crossref","unstructured":"HsuC\u2010H KuehnJA PooleSW.Towards efficient supercomputing: searching for the right efficiency metric.Proceedings of the 3rd ACM\/SPEC International Conference on Performance Engineering ICPE '12 ACM New York NY USA 2012;157\u2013162.","DOI":"10.1145\/2188286.2188309"},{"key":"e_1_2_8_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2012.95"},{"key":"e_1_2_8_21_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.3180"},{"key":"e_1_2_8_22_1","unstructured":"Sch\u00f6neR HackenbergD MolkaD.Memory performance at reduced CPU clock speeds: an analysis of current x86\u201064 processors.Proceedings of the 2012 USENIX Conference on Power\u2010Aware Computing and Systems HotPower'12 USENIX Association Berkeley CA USA 2012. (Available from:https:\/\/www.usenix.org\/system\/files\/conference\/hotpower12\/ hotpower12-final5.pdf) [Accessed on 26 March 2015]."},{"key":"e_1_2_8_23_1","doi-asserted-by":"crossref","unstructured":"ChoiJW BedardD FowlerR VuducR.A roofline model of energy.2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS) 2013;661\u2013672. DOI:10.1109\/IPDPS.2013.77.","DOI":"10.1109\/IPDPS.2013.77"},{"key":"e_1_2_8_24_1","doi-asserted-by":"publisher","DOI":"10.1142\/S0129626409000389"},{"issue":"2","key":"e_1_2_8_25_1","first-page":"427","article-title":"Two\u2010relaxation\u2010time lattice Boltzmann scheme: about parametrization, velocity, pressure and mixed boundary conditions","volume":"3","author":"Ginzburg I","year":"2008","journal-title":"Communications and Computer of Physics"},{"key":"e_1_2_8_26_1","unstructured":"SuperMUC petascale system. (Available from:http:\/\/www.lrz.de\/services\/compute\/supermuc) [Accessed on 26 March 2015]."},{"key":"e_1_2_8_27_1","first-page":"615","volume-title":"Proceedings of the Workshop Memory Issues on Multi\u2010 and Manycore Platforms at PPAM 2009, the 8th International Conference on Parallel Processing and Applied Mathematics","author":"Treibig J","year":"2010"},{"key":"e_1_2_8_28_1","volume-title":"Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers","author":"Sch\u00f6nauer Willi","year":"2000"},{"key":"e_1_2_8_29_1","doi-asserted-by":"publisher","DOI":"10.1142\/S0129626410000296"},{"key":"e_1_2_8_30_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342012442424"},{"key":"e_1_2_8_31_1","unstructured":"Intel.Intel Architecture Code Analyzer June2012. (Available from:http:\/\/software.intel.com\/en-us\/articles\/intel-architecture-code-analyzer\/) [Accessed on 26 March 2015]."},{"key":"e_1_2_8_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/1353534.1346317"},{"key":"e_1_2_8_33_1","unstructured":"Intel Corp.Intel 64 and IA\u201032 Architectures Software Developer's Manual 2013. (Available from:http:\/\/download.intel.com\/products\/processor\/manual\/325384.pdf) [Accessed on 26 March 2015]."},{"key":"e_1_2_8_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2012.12"},{"key":"e_1_2_8_35_1","unstructured":"HuberH.LRZ Private Communication."},{"key":"e_1_2_8_36_1","unstructured":"Intel Corp.Intel MPI benchmarks. (Available from:http:\/\/software.intel.com\/en-us\/articles\/intel-mpi-benchmarks) [Accessed on 26 March 2015]."}],"container-title":["Concurrency and Computation: Practice and Experience"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/api.wiley.com\/onlinelibrary\/tdm\/v1\/articles\/10.1002%2Fcpe.3489","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/cpe.3489","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,2]],"date-time":"2023-09-02T14:16:40Z","timestamp":1693664200000},"score":1,"resource":{"primary":{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/cpe.3489"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,5,18]]},"references-count":35,"journal-issue":{"issue":"7","published-print":{"date-parts":[[2016,5]]}},"alternative-id":["10.1002\/cpe.3489"],"URL":"https:\/\/doi.org\/10.1002\/cpe.3489","archive":["Portico"],"relation":{},"ISSN":["1532-0626","1532-0634"],"issn-type":[{"value":"1532-0626","type":"print"},{"value":"1532-0634","type":"electronic"}],"subject":[],"published":{"date-parts":[[2015,5,18]]}}}