0% found this document useful (0 votes)

75 views18 pages

Lec3 - Cache and Memory System

This document discusses CPU caches and memory systems. It provides a brief history of CPU caches from the 80386 with no cache to modern CPUs with multiple levels of large caches. It describes the cache hierarchy and bandwidth in modern systems. It discusses concepts like cache lines, prefetching, and cache coherence. It also covers issues that can impact performance like capacity misses, false sharing, and instruction cache effects. Strategies are presented for addressing these issues like blocked algorithms, data layout, and function ordering.

Uploaded by

Văn Nam Ngô

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views18 pages

Lec3 - Cache and Memory System

Uploaded by

Văn Nam Ngô

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 18

CS 295: Modern Systems

Cache And Memory System

Sang-Woo Jun
Spring, 2019
Watch Video!
CPU Caches and Why You Care – Scott Meyers
o His books are great!
Some History

 80386 (1985) :
Last Intel desktop CPU with no on-chip cache
(Optional on-board cache chip though!)

 80486 (1989) : 4 KB on-chip cache

 Coffee Lake (2017) :

64 KiB L1 Per core
256 KiB L2 Per core
Up to 2 MiB L3 Per core (Shared)

What is the Y-axis? Most likely normalized latency reciprocal

Source: Extreme tech, “How L1 and L2 CPU Caches Work, and Why They’re an Essential Part of Modern Chips,” 2018
Memory System Architecture
Package Package

Core Core Core Core

L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$

L2 $ L2 $ L2 $ L2 $

L3 $ L3 $

QPI / UPI

DRAM DRAM
Memory System Bandwidth Snapshot
Cache Bandwidth Estimate
64 Bytes/Cycle ~= 200 GB/s/Core
Core Core
DDR4 2666 MHz
128 GB/s

QPI / UPI

DRAM Ultra Path Interconnect DRAM

Unidirectional
20.8 GB/s

Memory/PCIe controller used to be on a separate “North bridge” chip, now integrated on-die
All sorts of things are now on-die! Even network controllers! (Specialization!)
Cache Architecture Details
Numbers from modern Xeon processors (Broadwell – Kaby lake)
Cache Level Size Latency (Cycles) Core Core
L1 64 KiB <5 L1 I$ L1 D$ L1 I$ L1 D$
L2 256 KiB < 20
L2 $ L2 $
L3 ~ 2 MiB per core < 50

DRAM 100s of GB > 100* L3 $

L1 cache is typically divided into separate instruction and data cache

Each with 32 KiB

L1 cache accesses can be hidden in the pipeline

o All others take a performance hit

* DRAM subsystems are complicated entities themselves, and latency/bandwidth of the same module varies by situation…
Cache Lines Recap
Caches are managed in Cache Line granularity
o Typically 64 Bytes for modern CPUs
o 64 Bytes == 16 4-byte integers
Reading/Writing happens in cache line granularity
o Read one byte not in cache -> Read all 64 bytes from memory
o Write one byte -> Eventually write all 64 bytes to memory
o Inefficient cache access patterns really hurt performance!
Cache Line Effects Example
Multiplying two 2048 x 2048 matrices
o 16 MiB, doesn’t fit in any cache
Machine: Intel i5-7400 @ 3.00GHz
Time to transpose B is also counted
A B A BT

…
× VS ×
…

…
63.19 seconds 10.39 seconds
(6x performance!)
Cache Prefetching
CPU speculatively prefetches cache lines
o While CPU is working on the loaded 64 bytes, 64 more bytes are being loaded
Hardware prefetcher is usually not very complex/smart
o Sequential prefetching (N lines forward or backwards)
o Strided prefetching
Programmer-provided prefetch hints
o __builtin_prefetch(address, r/w, temporal_locality); for GCC
Cache Coherence Recap
We won’t go into architectural details
Simply put:
o When a core writes a cache line
o All other instances of that cache line needs to be invalidated
Emphasis on cache line
Issue #1:
Capacity Considerations: Matrix Multiply
Performance is best when working set fits into cache
o But as shown, even 2048 x 2048 doesn’t fit in cache
o -> 2048 * 2048 * 2048 elements read from memory for matrix B
Solution: Divide and conquer! – Blocked matrix multiply
o For block size 32 × 32 -> 2048 * 2048 * (2048/32) reads
A BT C
A1 B1 … C1

B2
× B3
=

C1 sub-matrix = A1×B1 + A1×B2 + A1×B3 …

Blocked Matrix Multiply Evaluations
Benchmark Elapsed (s) Normalized
Performance
Naïve 63.19 1
Transposed 10.39 6.08
Blocked Transposed 7.35 8.60
Bottlenecked by computations
AVX Transposed 2.20 28.72
Bottlenecked by memory
Blocked AVX Transposed 1.50 42.13
Bottlenecked by computation
8 Thread AVX Transposed 1.09 57.97
Bottlenecked by memory (Not scaling!)
AVX Transposed reading from DRAM at 14.55 GB/s
o 20483 * 4 (Bytes) / 2.20 (s) = 14.55 GB/s
o 1x DDR4 2400 MHz on machine -> 18.75 GB/s peak
o Pretty close! Considering DRAM also used for other things (OS, etc)
Multithreaded getting 32 GB/s effective bandwidth
o Cache effects with small chunks
Issue #2:
False Sharing
Different memory locations, written to by different cores, mapped to
same cache line
o Core 1 performing “results[0]++;”
o Core 2 performing “results[1]++;”
Remember cache coherence
o Every time a cache is written to, all other instances need to be invalidated!
o “results” variable is ping-ponged across cache coherence every time
o Bad when it happens on-chip, terrible over processor interconnect (QPI/UPI)
Remember the case in the Scott Meyers video
Results From Scott Meyers Video Again
Issue #3
Instruction Cache Effects
Instruction cache misses can effect performance
o “Linux was routing packets at ~30Mbps [wired], and wireless at ~20. Windows CE
was crawling at barely 12Mbps wired and 6Mbps wireless. […] After we changed
the routing algorithm to be more cache-local, we started doing 35MBps [wired],
and 25MBps wireless – 20% better than Linux.
– Sergey Solyanik, Microsoft
o [By organizing function calls in a cache-friendly way, we] achieved a 34% reduction
in instruction cache misses and a 5% improvement in overall performance.
-- Mircea Livadariu and Amir Kleen, Freescale
Improving Instruction Cache Locality #1
Unfortunately, not much we can do explicitly… We can…
Careful with loop unrolling and inlining
o They reduce branching overhead, but reduces effective I$ size
o When gcc’s –O3 performs slower than –O2, this is usually what’s happening
o Inlining is typically good for very small* functions
Move conditionals to front as much as possible
o Long paths of no branches good fit with instruction cache/prefetcher
Improving Instruction Cache Locality #2
Organize function calls to create temporal locality

Sequential algorithm Ordering changed for Balance to reduce

cache locality memory footprint
Livadariu et. al., “Optimizing for instruction caches,” EETimes
https://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coh
erency/

Vices
100% (1)
Vices
6 pages
Draft Petition For Legal Beneficiary
67% (3)
Draft Petition For Legal Beneficiary
4 pages
Hospital Management System Database Design
100% (1)
Hospital Management System Database Design
8 pages
An Essay On Education by Robert V. Longo
100% (1)
An Essay On Education by Robert V. Longo
17 pages
Psychology in Modern India
100% (1)
Psychology in Modern India
19 pages
CPR Checklist
No ratings yet
CPR Checklist
2 pages
Bookshops in Hay-on-Wye
No ratings yet
Bookshops in Hay-on-Wye
4 pages
Panel Hospitals
No ratings yet
Panel Hospitals
3 pages
TWIML - PODCAST - First Notes
100% (2)
TWIML - PODCAST - First Notes
3 pages
Behavioural Interview Questions
No ratings yet
Behavioural Interview Questions
3 pages
VSR 411 QB Anaesthesia
No ratings yet
VSR 411 QB Anaesthesia
7 pages
Manhwa
No ratings yet
Manhwa
7 pages
Hatchet Draft 1
No ratings yet
Hatchet Draft 1
2 pages
Admission of Patient For Surgery
No ratings yet
Admission of Patient For Surgery
5 pages
Accomplishment Report-Catch-Up-Fridays
No ratings yet
Accomplishment Report-Catch-Up-Fridays
10 pages
Sokolova2019 PDF
No ratings yet
Sokolova2019 PDF
9 pages
WDR2016 Concept Note
No ratings yet
WDR2016 Concept Note
21 pages
8888 Uprising - Wikipedia, The Free Encyclopedia
No ratings yet
8888 Uprising - Wikipedia, The Free Encyclopedia
12 pages
Top Grade Homeopathic Medicines For Anger Control and Management - Homeopathy at DrHomeo
No ratings yet
Top Grade Homeopathic Medicines For Anger Control and Management - Homeopathy at DrHomeo
2 pages
Physiotherapy An Active Transformational and Authentic Career Choice
No ratings yet
Physiotherapy An Active Transformational and Authentic Career Choice
15 pages
Pass Res B1plus UT 8A
No ratings yet
Pass Res B1plus UT 8A
3 pages
Chapter1 Quickstart
No ratings yet
Chapter1 Quickstart
19 pages
Understanding The Self
No ratings yet
Understanding The Self
7 pages
Project New
No ratings yet
Project New
27 pages
Lesson 2 EXECUTIVE BRANCH
No ratings yet
Lesson 2 EXECUTIVE BRANCH
25 pages
Wire Sculpture Lesson Plan
No ratings yet
Wire Sculpture Lesson Plan
4 pages
Mis List
No ratings yet
Mis List
47 pages
Communication Art
No ratings yet
Communication Art
19 pages
Controlling Chapter Quiz
No ratings yet
Controlling Chapter Quiz
1 page
L 2what Is Inclusive History
No ratings yet
L 2what Is Inclusive History
2 pages
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6458)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5181)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (1005)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (464)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (582)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2016)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2814)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1022)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4135)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4372)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (280)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2133)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lec3 - Cache and Memory System

Uploaded by

Lec3 - Cache and Memory System

Uploaded by

CS 295: Modern Systems

Cache And Memory System

 80486 (1989) : 4 KB on-chip cache

 Coffee Lake (2017) :

What is the Y-axis? Most likely normalized latency reciprocal

Core Core Core Core

DRAM Ultra Path Interconnect DRAM

DRAM 100s of GB > 100* L3 $

L1 cache is typically divided into separate instruction and data cache

L1 cache accesses can be hidden in the pipeline

C1 sub-matrix = A1×B1 + A1×B2 + A1×B3 …

Sequential algorithm Ordering changed for Balance to reduce

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.