High Performance Scientific Computing: S. Gopalakrishnan!
High Performance Scientific Computing: S. Gopalakrishnan!
Lecture 4
S. Gopalakrishnan!
Memory Issues
Memory hierarchy
Faster
Costlier
Typical Hierarchy
Memory Latency Problem
Cache/MM virtual memory
Processor-DRAM Memory Performance Gap
Motivation for Memory Hierarchy
C µProc
1000CPU 8B a 32 B Memory 4 KB
CPU Memory disk
60%/yr.
disk
regs c (2X/1.5yr)
Performance
regs
h
100 Processor-Memory
e
Performance Gap:
! Notice 10
that the data width is changing (grows 50% / year)
• Why? DRAM
! Bandwidth: Transfer rate between various levels 5%/yr.
1 (2X/15 yrs)
• CPU-Cache: 24 GBps
1980
1984
1986
1988
1989
1990
1992
1994
1996
1998
1999
1981
1982
1983
1985
1987
1991
1993
1995
1997
2000
• Cache-Main: 0.5-6.4GBps
• Main-Disk: 187MBps (serial ATA/1500)
Time
ECE232: Memory Hierarchy 5 Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren
Source:Ece
ECE232: Memory Hierarchy 12 232 Umass-Amherst
Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren
Virtual Memory and Paging
Virtual memory Physical
(per process) memory
Another
process's
memory
RAM
From Processor
Block X
L2 Memory
Data-
regs
1984
1986
1988
1989
1990
1992
1994
1996
1998
1999
1981
1982
1983
1985
1987
1991
1993
1995
1997
2000
Technology: Regs SRAM SRAM DRAM Disk
Source:Ece
ECE232: Memory Hierarchy 16 232 Umass-Amherst
Adapted from Computer Organization and Design,Patterson&Hennessy, UCB, Kundu,UMass Koren
Introduction to Parallel Programming
Shared'Memory,Processing,
,Each,processor,can,access,the,en6re,data,space,
,
– Pro’s,
• Easier,to,program,
• Amenable,to,automa6c,parallelism,
• Can,be,used,to,run,large,memory,serial,programs,
– Con’s,
• Expensive,
• Difficult,to,implement,on,the,hardware,level,
• Processor,count,limited,by,conten6on/coherency,(currently,around,512),
• Watch,out,for,“NU”,part,of,“NUMA”,
Distributed*–*Memory*Machines*
! Each*node*in*the*computer*has*a*locally*addressable*memory*space*
! The*computers*are*connected*together*via*some*high:speed*network*
– Infiniband,*Myrinet,*Giganet,*etc..*
• Pros*
– Really*large*machines*
– Size*limited*only*by*gross*physical*
consideraFons:*
• Room*size*
• Cable*lengths*(10’s*of*meters)*
• Power/cooling*capacity*
• Money!*
– Cheaper*to*build*and*run*
• Cons*
– Harder*to*program*
* *Data*Locality*
MPPs$(Massively$Parallel$Processors)$
Distributed$memory$at$largest$scale.$$OTen$shared$memory$
$at$lower$hierarchies.$
• IBM$BlueGene/L$(LLNL)$
– 131,072$700$Mhz$processors$
– 256$MB$or$RAM$per$processor$
– Balanced$compute$speed$with$interconnect$
! Red$Storm$(Sandia$NaJonal$Labs)$
– 12,960$Dual$Core$2.4$Ghz$Opterons$
– 4$GB$of$RAM$per$processor$
– Proprietary$SeaStar$interconnect$
fundamentally different design
Comparison of CPU vs GPU Architecture
philosophies.
ALU ALU
Control
ALU ALU
CPU GPU
Cache
DRAM DRAM