Locality-aware data replication in the last-level cache for large scale multicores

Hijaz, Farrukh; Shi, Qingchuan; Kurian, George; Devadas, Srinivas; Khan, Omer

doi:10.1007/s11227-015-1608-4

Locality-aware data replication in the last-level cache for large scale multicores

Published: 04 February 2016

Volume 72, pages 718–752, (2016)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

362 Accesses
Explore all metrics

Abstract

Next generation large single-chip multicores will process massive data with varying degree of locality. Harnessing on-chip data locality to optimize the utilization of on-chip cache and network resources is of fundamental importance. We propose a locality-aware selective data replication protocol for the last-level cache (LLC). The goal is to lower memory access latency and energy by only replicating cache lines with high reuse in the LLC slice of the requesting core, while simultaneously keep the off-chip miss rate low. The approach relies on low-overhead yet highly accurate in-hardware runtime cache line level classifier that only allows replication of cache lines with high reuse. Furthermore, a classifier captures the LLC pressure at the existing replica locations and adapts its replication decision accordingly. On a set of parallel benchmarks, the proposed protocol reduces overall energy by 14.7, 10.7, 10.5, and 16.7 % and completion time by 2.5, 6.5, 4.5, and 9.5 % when compared to the previously proposed Victim Replication, Adaptive Selective Replication, Reactive-NUCA, and Static-NUCA LLC management schemes. An efficient classifier implementation is evaluated with an overhead of 5.44 KB, which translates to only 1.58 % on top of the Static-NUCA baseline’s cache related per-core storage.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An adaptive migration–replication scheme (AMR) for shared cache in chip multiprocessors

Article 26 July 2015

Last level cache size heterogeneity in embedded systems

Article 09 January 2016

ACAM: Application Aware Adaptive Cache Management for Shared LLC

References

Dreslinski RG, Fick D, Giridhar B, Kim G, Seo S, Fojtik M, Satpathy S, Lee Y, Kim D, Liu N, Wieckowski M, Chen G, Sylvester D, Blaauw D, Mudge T (2013) Centip3de: a 64-core, 3d stacked near-threshold system. IEEE Micro 33(2):8–16. doi:10.1109/MM.2013.4
Article Google Scholar
Kaul H, Anders M, Hsu S, Agarwal A, Krishnamurthy R, Borkar S (2012) Nearthreshold voltage (ntv) design: opportunities and challenges. In: Design Automation Conference. ACM, pp 1149–1154
Borkar S (2007) Thousand core chips: a technology perspective. In: Proceedings of the 44th annual design automation conference. ACM, New York, NY, USA, DAC’07, pp 746–749. doi:10.1145/1278480.1278667
Bell S, Edwards B, Amann J, Conlin R, Joyce K, Leung V, MacKay J, Reif M, Bao L, Brown J, Mattina M, Miao CC, Ramey C, Wentzlaff D, Anderson W, Berger E, Fairbanks N, Khan D, Montenegro F, Stickney J, Zook J (2008) Tile64-processor: a 64-core soc with mesh interconnect. In: IEEE international solid-state circuits conference, 2008. ISSCC 2008. Digest of Technical Papers, pp 88–598. doi:10.1109/ISSCC.2008.4523070
Agarwal A, Simoni R, Hennessy JL, Horowitz M (1988) An Evaluation of Directory Schemes for Cache Coherence. In: International symposium on computer architecture
Martin MMK, Hill MD, Sorin DJ (2012) Why on-chip cache coherence is here to stay. Commun ACM 55(7):78–89
Article Google Scholar
Sanchez D, Kozyrakis C (2012) SCD: a scalable coherence directory with flexible sharer set encoding. In: International symposium on high-performance computer architecture
Zhao H, Shriraman A, Dwarkadas S (2010) SPACE: sharing pattern-based directory coherence for multicore scalability. In: International conference on parallel architectures and compilation techniques, pp 135–146
Zebchuk J, Srinivasan V, Qureshi MK, Moshovos A (2009) A tagless coherence directory. In: International symposium on microarchitecture
Eisley N, Peh LS, Shang L (2006) In-network cache coherence. In: IEEE/ACM International symposium on microarchitecture, MICRO 39:321–332. doi:10.1109/MICRO.2006.27
Google Scholar
Kurian G, Khan O, Devadas S (2013) The locality-aware adaptive cache coherence protocol. In: Proceedings of the 40th annual international symposium on computer architecture. ACM, New York, NY, USA, ISCA’13, pp 523–534. doi:10.1145/2485922.2485967
Conway P, Kalyanasundharam N, Donley G, Lepak K, Hughes B (2010) Cache hierarchy and memory subsystem of the amd opteron processor. Micro IEEE 30(2):16–29. doi:10.1109/MM.2010.31
Article Google Scholar
First the tick, now the tock: next generation intel microarchitecture (Nehalem). White Paper (2008)
Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: International conference on architectural support for programming languages and operating systems (ASPLOS), pp 211–222
Chishti Z, Powell MD, Vijaykumar TN (2005) Optimizing replication, communication, and capacity allocation in cmps. In: Proceedings of the 32Nd Annual international symposium on computer architecture, IEEE computer society, Washington, DC, USA, ISCA’05, pp 357–368. doi:10.1109/ISCA.2005.39
Zhang M, Asanovic K (2005) Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In: international symposium on computer architecture. doi:10.1109/ISCA.2005.53
Beckmann BM, Marty MR, Wood DA (2006) Wood. Asr: adaptive selective replication for cmp caches. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE computer society, Washington, DC, USA, MICRO 39, pp 443–454. doi:10.1109/MICRO.2006.10
Chaudhuri M (2009) PageNUCA: selected policies for page-grain locality management in large shared chip-multiprocessor caches. In: HPCA, pp 227–238
Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2009) Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches. In: Proceedings of the 36th annual international symposium on computer architecture (ISCA’09). ACM, New York, NY, USA, pp 184–195
Google Scholar
Shi Q, Hijaz F, Khan O (2013) Towards efficient dynamic data placement in noc-based multicores. In: IEEE 31st International Conference on Computer Design (ICCD), 2013, pp 369–376. doi:10.1109/ICCD.2013.6657067
Merino J, Puente V, Gregorio J (2010) Esp-nuca: a low-cost adaptive non-uniform cache architecture. In: IEEE 16th international symposium on high performance computer architecture (HPCA), 2010, pp 1–10. doi:10.1109/HPCA.2010.5416641
Censier LM, Feautrier P (1978) A new solution to coherence problems in multicache systems. IEEE Trans Comput 27(12):1112–1118. doi:10.1109/TC.1978.1675013
Article MATH Google Scholar
Bell S, Edwards B, Amann J, Conlin R, Joyce K, Leung V, MacKay J, Reif M, Bao L, Brown J, Mattina M, Miao C, Ramey C, Wentzlaff D, Anderson W, Berger E, Fairbanks N, Khan D, Montenegro F, Stickney J, Zook J (2008) TILE64-processor: a 64-Core SoC with mesh interconnect. In: International Solid-State Circuits Conference
Kurian G, Miller J, Psota J, Eastep J, Liu J, Michel J, Kimerling L, Agarwal A (2010) ATAC: a 1000-core cache-coherent processor with on-chip optical network. In: International conference on parallel architectures and compilation techniques
Cho S, Jin L (2006) Managing distributed, shared l2 caches through os-level page allocation. In: Proceedings of the 39th annual IEEE/ACM international symposium on microarchitecture, IEEE computer society, Washington, DC, USA, MICRO 39, pp 455–468. doi:10.1109/MICRO.2006.31. http://dl.acm.org/citation.cfm?id=1194858
Awasthi M, Sudan K, Balasubramonian R, Carter J (2009) Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In: IEEE 15th international symposium on high performance computer architecture, 2009. HPCA 2009, pp 250–261. doi:10.1109/HPCA.2009.4798260
Kurian G, Devadas S, Khan O (2014) Locality-aware data replication in the last-level cache. In: IEEE 120th international symposium on high performance computer architecture (HPCA2014), 2014
Chang J, Sohi G (2006) Cooperative caching for chip multiprocessors. In: 33rd international symposium on computer architecture, 2006. ISCA’06, pp 264–276. doi:10.1109/ISCA.2006.17
Herrero E, González J, Canal R (2010) Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors. In: Proceedings of the 37th Annual international symposium on computer architecture. ACM, New York, NY, USA, ISCA’10, pp 419–428. doi:10.1145/1815961.1816018
Qureshi MK (2009) Adaptive spill-receive for robust high-performance caching in cmps. In: IEEE 15th international symposium on high performance computer architecture, 2009. HPCA 2009, pp 45–54. doi:10.1109/HPCA.2009.4798236
Srikantaiah S, Kultursay E, Zhang T, Kandemir M, Irwin MJ, Xie Y (2011) Morphcache: a reconfigurable adaptive multi-level cache hierarchy. In: IEEE 17th international symposium on high performance computer architecture (HPCA), 2011 pp 231–242. doi:10.1109/HPCA.2011.5749732
Lee H, Cho S, Childers B (2011) Cloudcache: Expanding and shrinking private caches. In: IEEE 17th international symposium on high performance computer architecture (HPCA), 2011 pp 219–230. doi:10.1109/HPCA.2011.5749731
Sorin DJ, Hill MD, Wood DA (2011) A primer on memory consistency and cache coherence. Synthesis lectures in computer architecture. Morgan Claypool Publishers, San Rafael
Google Scholar
Jaleel A, Borch E, Bhandaru M, Steely Jr SC, Emer J (2010) Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (tla) cache management policies. In: Proceedings of the 2010 43rd annual IEEE/ACM international symposium on microarchitecture, IEEE computer society, Washington, DC, USA, MICRO’43, pp 151–162. doi:10.1109/MICRO.2010.52
Miller JE, Kasture H, Kurian G, Gruenwald C, Beckmann N, Celio C, Eastep J, Agarwal A (2010) A distributed parallel simulator for multicores. In: 16th international symposium on high performance computer architecture (HPCA), pp 1–12
Dally WJ, Towles B (2004) Principles and practices of interconnection networks. Morgan Kaufmann
Park S, Krishna T, Chen CH, Daya B, Chandrakasan A, Peh LS (2012) Approaching the theoretical limits of a mesh noc with a 16-node chip prototype in 45nm soi. In: Proceedings of the 49th annual design automation conference (DAC’12). ACM, New York, NY, USA, pp 398–405
Chapter Google Scholar
Sun C, Chen CHO, Kurian G, Wei L, Miller J, Agarwal A, Peh LS, Stojanovic V (2012) DSENT-a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In: 6th IEEE/ACM international symposium on symposium on networks-on-chip (NoCS), pp 201–210, 9–11 May 2012
Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: 42nd annual IEEE/ACM international symposium on microarchitecture, MICRO-42, pp 469–480, 12–16 Dec 2009
Thoziyoor S, Ahn JH, Monchiero M, Brockman JB, Jouppi NP (2008) A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In: 35th international symposium on computer architecture, ISCA’08, pp 51–62, 21–25 June 2008
Khakifirooz A, Nayfeh OM, Antoniadis D (2009) A simple semiempirical short-channel MOSFET current-voltage model continuous across all regions of operation and employing only physical parameters. IEEE Transactions Electron Devices 56(8):1674–1680
Article Google Scholar
Wei L, Boeuf F, Skotnicki T, Wong HS (2011) Parasitic capacitances: analytical models and impact on circuit-Level performance. IEEE Transactions on Electron Devices 58(5):1361–1370
Article Google Scholar
Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 Programs: characterization and methodological considerations. In: Proceedings of 22nd annual international symposium on computer architecture, pp 24–36, 22–24 June 1995
Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC Benchmark Suite: characterization and architectural implications. In: Proceedings of the 17th international conference on parallel architectures and compilation techniques (PACT’08). ACM, New York, NY, USA, pp 72–81
Chapter Google Scholar
Yu X, Bezerra G, Pavlo A, Devadas S, Stonebraker M (2014) Staring into the abyss: an evaluation of concurrency control with one thousand cores. Proc VLDB Endow 8(3):209–220. doi:10.14778/2735508.2735511
Article MATH Google Scholar
Iqbal S, Liang Y, Grahn H (2010) ParMiBench - an open-source benchmark for embedded multiprocessor systems. Comput Archit Lett
DARPA UHPC Program BAA. https://www.fbo.gov/spg/ODA/DARPA/CMO/DARPA-BAA-10-37/listing.html (2010)
Ahmad M, Hijaz F, Shi Q, Khan O (2015) A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In: IEEE international symposium on workload characterization (IISWC), 2015 pp 44–55. doi:10.1109/IISWC.2015.11

Download references

Author information

George Kurian
Present address: Google, Mountain View, CA, USA

Authors and Affiliations

University of Connecticut, Storrs, CT, USA
Farrukh Hijaz, Qingchuan Shi & Omer Khan
Massachusetts Institute of Technology, Cambridge, MA, USA
George Kurian & Srinivas Devadas

Authors

Farrukh Hijaz
View author publications
You can also search for this author in PubMed Google Scholar
Qingchuan Shi
View author publications
You can also search for this author in PubMed Google Scholar
George Kurian
View author publications
You can also search for this author in PubMed Google Scholar
Srinivas Devadas
View author publications
You can also search for this author in PubMed Google Scholar
Omer Khan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Omer Khan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hijaz, F., Shi, Q., Kurian, G. et al. Locality-aware data replication in the last-level cache for large scale multicores. J Supercomput 72, 718–752 (2016). https://doi.org/10.1007/s11227-015-1608-4

Download citation

Published: 04 February 2016
Issue Date: February 2016
DOI: https://doi.org/10.1007/s11227-015-1608-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Locality-aware data replication in the last-level cache for large scale multicores

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An adaptive migration–replication scheme (AMR) for shared cache in chip multiprocessors

Last level cache size heterogeneity in embedded systems

ACAM: Application Aware Adaptive Cache Management for Shared LLC

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Locality-aware data replication in the last-level cache for large scale multicores

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An adaptive migration–replication scheme (AMR) for shared cache in chip multiprocessors

Last level cache size heterogeneity in embedded systems

ACAM: Application Aware Adaptive Cache Management for Shared LLC

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.