Abstract
Next generation large single-chip multicores will process massive data with varying degree of locality. Harnessing on-chip data locality to optimize the utilization of on-chip cache and network resources is of fundamental importance. We propose a locality-aware selective data replication protocol for the last-level cache (LLC). The goal is to lower memory access latency and energy by only replicating cache lines with high reuse in the LLC slice of the requesting core, while simultaneously keep the off-chip miss rate low. The approach relies on low-overhead yet highly accurate in-hardware runtime cache line level classifier that only allows replication of cache lines with high reuse. Furthermore, a classifier captures the LLC pressure at the existing replica locations and adapts its replication decision accordingly. On a set of parallel benchmarks, the proposed protocol reduces overall energy by 14.7, 10.7, 10.5, and 16.7 % and completion time by 2.5, 6.5, 4.5, and 9.5 % when compared to the previously proposed Victim Replication, Adaptive Selective Replication, Reactive-NUCA, and Static-NUCA LLC management schemes. An efficient classifier implementation is evaluated with an overhead of 5.44 KB, which translates to only 1.58 % on top of the Static-NUCA baseline’s cache related per-core storage.














Similar content being viewed by others
References
Dreslinski RG, Fick D, Giridhar B, Kim G, Seo S, Fojtik M, Satpathy S, Lee Y, Kim D, Liu N, Wieckowski M, Chen G, Sylvester D, Blaauw D, Mudge T (2013) Centip3de: a 64-core, 3d stacked near-threshold system. IEEE Micro 33(2):8–16. doi:10.1109/MM.2013.4
Kaul H, Anders M, Hsu S, Agarwal A, Krishnamurthy R, Borkar S (2012) Nearthreshold voltage (ntv) design: opportunities and challenges. In: Design Automation Conference. ACM, pp 1149–1154
Borkar S (2007) Thousand core chips: a technology perspective. In: Proceedings of the 44th annual design automation conference. ACM, New York, NY, USA, DAC’07, pp 746–749. doi:10.1145/1278480.1278667
Bell S, Edwards B, Amann J, Conlin R, Joyce K, Leung V, MacKay J, Reif M, Bao L, Brown J, Mattina M, Miao CC, Ramey C, Wentzlaff D, Anderson W, Berger E, Fairbanks N, Khan D, Montenegro F, Stickney J, Zook J (2008) Tile64-processor: a 64-core soc with mesh interconnect. In: IEEE international solid-state circuits conference, 2008. ISSCC 2008. Digest of Technical Papers, pp 88–598. doi:10.1109/ISSCC.2008.4523070
Agarwal A, Simoni R, Hennessy JL, Horowitz M (1988) An Evaluation of Directory Schemes for Cache Coherence. In: International symposium on computer architecture
Martin MMK, Hill MD, Sorin DJ (2012) Why on-chip cache coherence is here to stay. Commun ACM 55(7):78–89
Sanchez D, Kozyrakis C (2012) SCD: a scalable coherence directory with flexible sharer set encoding. In: International symposium on high-performance computer architecture
Zhao H, Shriraman A, Dwarkadas S (2010) SPACE: sharing pattern-based directory coherence for multicore scalability. In: International conference on parallel architectures and compilation techniques, pp 135–146
Zebchuk J, Srinivasan V, Qureshi MK, Moshovos A (2009) A tagless coherence directory. In: International symposium on microarchitecture
Eisley N, Peh LS, Shang L (2006) In-network cache coherence. In: IEEE/ACM International symposium on microarchitecture, MICRO 39:321–332. doi:10.1109/MICRO.2006.27
Kurian G, Khan O, Devadas S (2013) The locality-aware adaptive cache coherence protocol. In: Proceedings of the 40th annual international symposium on computer architecture. ACM, New York, NY, USA, ISCA’13, pp 523–534. doi:10.1145/2485922.2485967
Conway P, Kalyanasundharam N, Donley G, Lepak K, Hughes B (2010) Cache hierarchy and memory subsystem of the amd opteron processor. Micro IEEE 30(2):16–29. doi:10.1109/MM.2010.31
First the tick, now the tock: next generation intel microarchitecture (Nehalem). White Paper (2008)
Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: International conference on architectural support for programming languages and operating systems (ASPLOS), pp 211–222
Chishti Z, Powell MD, Vijaykumar TN (2005) Optimizing replication, communication, and capacity allocation in cmps. In: Proceedings of the 32Nd Annual international symposium on computer architecture, IEEE computer society, Washington, DC, USA, ISCA’05, pp 357–368. doi:10.1109/ISCA.2005.39
Zhang M, Asanovic K (2005) Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: international symposium on computer architecture. doi:10.1109/ISCA.2005.53
Beckmann BM, Marty MR, Wood DA (2006) Wood. Asr: adaptive selective replication for cmp caches. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE computer society, Washington, DC, USA, MICRO 39, pp 443–454. doi:10.1109/MICRO.2006.10
Chaudhuri M (2009) PageNUCA: selected policies for page-grain locality management in large shared chip-multiprocessor caches. In: HPCA, pp 227–238
Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2009) Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches. In: Proceedings of the 36th annual international symposium on computer architecture (ISCA’09). ACM, New York, NY, USA, pp 184–195
Shi Q, Hijaz F, Khan O (2013) Towards efficient dynamic data placement in noc-based multicores. In: IEEE 31st International Conference on Computer Design (ICCD), 2013, pp 369–376. doi:10.1109/ICCD.2013.6657067
Merino J, Puente V, Gregorio J (2010) Esp-nuca: a low-cost adaptive non-uniform cache architecture. In: IEEE 16th international symposium on high performance computer architecture (HPCA), 2010, pp 1–10. doi:10.1109/HPCA.2010.5416641
Censier LM, Feautrier P (1978) A new solution to coherence problems in multicache systems. IEEE Trans Comput 27(12):1112–1118. doi:10.1109/TC.1978.1675013
Bell S, Edwards B, Amann J, Conlin R, Joyce K, Leung V, MacKay J, Reif M, Bao L, Brown J, Mattina M, Miao C, Ramey C, Wentzlaff D, Anderson W, Berger E, Fairbanks N, Khan D, Montenegro F, Stickney J, Zook J (2008) TILE64-processor: a 64-Core SoC with mesh interconnect. In: International Solid-State Circuits Conference
Kurian G, Miller J, Psota J, Eastep J, Liu J, Michel J, Kimerling L, Agarwal A (2010) ATAC: a 1000-core cache-coherent processor with on-chip optical network. In: International conference on parallel architectures and compilation techniques
Cho S, Jin L (2006) Managing distributed, shared l2 caches through os-level page allocation. In: Proceedings of the 39th annual IEEE/ACM international symposium on microarchitecture, IEEE computer society, Washington, DC, USA, MICRO 39, pp 455–468. doi:10.1109/MICRO.2006.31. http://dl.acm.org/citation.cfm?id=1194858
Awasthi M, Sudan K, Balasubramonian R, Carter J (2009) Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In: IEEE 15th international symposium on high performance computer architecture, 2009. HPCA 2009, pp 250–261. doi:10.1109/HPCA.2009.4798260
Kurian G, Devadas S, Khan O (2014) Locality-aware data replication in the last-level cache. In: IEEE 120th international symposium on high performance computer architecture (HPCA2014), 2014
Chang J, Sohi G (2006) Cooperative caching for chip multiprocessors. In: 33rd international symposium on computer architecture, 2006. ISCA’06, pp 264–276. doi:10.1109/ISCA.2006.17
Herrero E, González J, Canal R (2010) Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors. In: Proceedings of the 37th Annual international symposium on computer architecture. ACM, New York, NY, USA, ISCA’10, pp 419–428. doi:10.1145/1815961.1816018
Qureshi MK (2009) Adaptive spill-receive for robust high-performance caching in cmps. In: IEEE 15th international symposium on high performance computer architecture, 2009. HPCA 2009, pp 45–54. doi:10.1109/HPCA.2009.4798236
Srikantaiah S, Kultursay E, Zhang T, Kandemir M, Irwin MJ, Xie Y (2011) Morphcache: a reconfigurable adaptive multi-level cache hierarchy. In: IEEE 17th international symposium on high performance computer architecture (HPCA), 2011 pp 231–242. doi:10.1109/HPCA.2011.5749732
Lee H, Cho S, Childers B (2011) Cloudcache: Expanding and shrinking private caches. In: IEEE 17th international symposium on high performance computer architecture (HPCA), 2011 pp 219–230. doi:10.1109/HPCA.2011.5749731
Sorin DJ, Hill MD, Wood DA (2011) A primer on memory consistency and cache coherence. Synthesis lectures in computer architecture. Morgan Claypool Publishers, San Rafael
Jaleel A, Borch E, Bhandaru M, Steely Jr SC, Emer J (2010) Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (tla) cache management policies. In: Proceedings of the 2010 43rd annual IEEE/ACM international symposium on microarchitecture, IEEE computer society, Washington, DC, USA, MICRO’43, pp 151–162. doi:10.1109/MICRO.2010.52
Miller JE, Kasture H, Kurian G, Gruenwald C, Beckmann N, Celio C, Eastep J, Agarwal A (2010) A distributed parallel simulator for multicores. In: 16th international symposium on high performance computer architecture (HPCA), pp 1–12
Dally WJ, Towles B (2004) Principles and practices of interconnection networks. Morgan Kaufmann
Park S, Krishna T, Chen CH, Daya B, Chandrakasan A, Peh LS (2012) Approaching the theoretical limits of a mesh noc with a 16-node chip prototype in 45nm soi. In: Proceedings of the 49th annual design automation conference (DAC’12). ACM, New York, NY, USA, pp 398–405
Sun C, Chen CHO, Kurian G, Wei L, Miller J, Agarwal A, Peh LS, Stojanovic V (2012) DSENT-a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In: 6th IEEE/ACM international symposium on symposium on networks-on-chip (NoCS), pp 201–210, 9–11 May 2012
Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: 42nd annual IEEE/ACM international symposium on microarchitecture, MICRO-42, pp 469–480, 12–16 Dec 2009
Thoziyoor S, Ahn JH, Monchiero M, Brockman JB, Jouppi NP (2008) A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In: 35th international symposium on computer architecture, ISCA’08, pp 51–62, 21–25 June 2008
Khakifirooz A, Nayfeh OM, Antoniadis D (2009) A simple semiempirical short-channel MOSFET current-voltage model continuous across all regions of operation and employing only physical parameters. IEEE Transactions Electron Devices 56(8):1674–1680
Wei L, Boeuf F, Skotnicki T, Wong HS (2011) Parasitic capacitances: analytical models and impact on circuit-Level performance. IEEE Transactions on Electron Devices 58(5):1361–1370
Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 Programs: characterization and methodological considerations. In: Proceedings of 22nd annual international symposium on computer architecture, pp 24–36, 22–24 June 1995
Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC Benchmark Suite: characterization and architectural implications. In: Proceedings of the 17th international conference on parallel architectures and compilation techniques (PACT’08). ACM, New York, NY, USA, pp 72–81
Yu X, Bezerra G, Pavlo A, Devadas S, Stonebraker M (2014) Staring into the abyss: an evaluation of concurrency control with one thousand cores. Proc VLDB Endow 8(3):209–220. doi:10.14778/2735508.2735511
Iqbal S, Liang Y, Grahn H (2010) ParMiBench - an open-source benchmark for embedded multiprocessor systems. Comput Archit Lett
DARPA UHPC Program BAA. https://www.fbo.gov/spg/ODA/DARPA/CMO/DARPA-BAA-10-37/listing.html (2010)
Ahmad M, Hijaz F, Shi Q, Khan O (2015) A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In: IEEE international symposium on workload characterization (IISWC), 2015 pp 44–55. doi:10.1109/IISWC.2015.11
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hijaz, F., Shi, Q., Kurian, G. et al. Locality-aware data replication in the last-level cache for large scale multicores. J Supercomput 72, 718–752 (2016). https://doi.org/10.1007/s11227-015-1608-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1608-4