ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
2
Multi-Tasking Paradigm
FU1 FU2 FU3 FU4
• Virtual memory makes it easy
Unused • Context switch could be
Thread 1
Execution Time Quantum
Conventional
Superscalar
Single
Threaded
3
Multi-threading Paradigm
FU1 FU2 FU3 FU4
Unused
Thread 1
Thread 2
Execution Time
Thread 3
Thread 4
Thread 5
4
Conventional Multithreading
• Zero-overhead context switch
• Duplicated contexts for threads
0:r0
0:r7
1:r0
CtxtPtr
1:r7
2:r0
2:r7
3:r0
3:r7
Register file
5
Cycle Interleaving MT
• Per-cycle, Per-thread instruction fetching
• Examples: HEP, Horizon, Tera MTA, MIT M-
machine
• Interesting questions to consider
– Does it need a sophisticated branch predictor?
– Or does it need any speculative execution at all?
• Get rid of “branch prediction”?
prediction
• Get rid of “predication”?
predication
– Does it need any out-of-order execution
capability?
6
Tera Multi-Threaded Architecture
• Cycle-by-cycle interleaving
• MTA can context-switch every cycle (3ns)
• As many as 128 distinct threads (hiding 384ns)
• 3-wide VLIW instruction format (M+ALU+ALU/Br)
• Each instruction has 3-bit for dependence lookahead
– Determine if there is dependency with subsequent instructions
– Execute up to 7 future VLIW instructions (before switch)
Loop:
nop r1=r2+r3 r5=r6+4 lookahead=1
nop r8=r9-r10 r11=r12-r13 lookahead=2
[r5]=r1 r4=r4-1 bnz Loop lookahead=0
7
Block Interleaving MT
• Context switch on a specific event (dynamic pipelining)
– Explicit switching: implementing a switch instruction
– Implicit switching: trigger when a specific instruction class fetched
• Static switching (switch upon fetching)
– Switch-on-memory-instructions: Rhamma processor
– Switch-on-branch or switch-on-hard-to-predict-branch
– Trigger can be implicit or explicit instruction
• Dynamic switching
– Switch-on-cache-miss (switch in later pipeline stage): MIT Sparcle
(MIT Alewife’s node), Rhamma Processor
– Switch-on-use (lazy strategy of switch-on-cache-miss)
• Wait until last minute
• Valid bit needed for each register
– Clear when load issued, set when data returned
– Switch-on-signal (e.g. interrupt)
– Predicated switch instruction based on conditions
• No need to support a large number of threads
8
NVidia Fermi GPGPU Architecture
Nvidia’s Streaming Multiprocessor (SM)
• SIMD execution model
• Issue one instruction from each
warp to 16 CUDA cores
• One warp = 32 parallel threads
Fdiv, unpipe
(16 cycles)
Fetch RS
RS && ROB
Unit
Decode FMult
plus
plus (4 cycles)
Physical
Physical Reg
Reg
Register FAdd Reg
Reg
Register
Register Register
Register FileReg
FileReg
PC Register
Register
RRename rr (2 cyc) FileReg
FileReg
PC Register
ename
RRename
Register
rr File
File File
PC
PC ename
Register File
File
PC RRename rr File
PC
PC Rename
Renamerr
ename
ALU1
PC
ALU2
I-CACHE
Load/Store D-CACHE
(variable)
11
Instruction Fetching Policy
• FIFO, Round Robin, simple but may be too naive
• Adaptive Fetching Policies
– BRCOUNT (reduce wrong path issuing)
• Count # of br inst in decode/rename/IQ stages
• Give top priority to thread with the least BRCOUNT
– MISSCOUT (reduce IQ clog)
• Count # of outstanding D-cache misses
• Give top priority to thread with the least MISSCOUNT
– ICOUNT (reduce IQ clog)
• Count # of inst in decode/rename/IQ stages
• Give top priority to thread with the least ICOUNT
– IQPOSN (reduce IQ clog)
• Give lowest priority to those threads with inst closest to the head of INT
or FP instruction queues
– Due to that threads with the oldest instructions will be most prone to IQ clog
• No Counter needed
12
Resource Sharing
• Could be tricky when threads compete for the resources
• Static
– Less complexity
– Could penalize threads (e.g. instruction window size)
– P4’s Hyperthreading
• Dynamic
– Complex
– What is fair? How to quantify fairness?
• Chip characteristics
– ~1.2V Vdd
– ~250 Million transistors
– ~1100 signal pins in flip chip packaging
15
Alpha 21464 (EV8) Processor
Architecture
16
SMT Pipeline
PC
Register
Map
Regs Dcache Regs
Icache
1000 Sun’s
Surface
Hot plate
Pentium III ® processor “Surpassed hot-plate power
10 Pentium II ® processor density in 0.5µm; Not too long
Pentium Pro ® processor to reach nuclear reactor,”
Former Intel Fellow Fred
i386 Pentium ® processor Pollack.
i486
1
1.5µ 1µ 0 .7µ 0.5 µ 0.35µ 0.25 µ 0.18µ 0.13µ 0.1µ 0.07µ
19
Latest Power Density Trend
21
Multi-core Processor Gala
22
Intel’s Multicore Roadmap
8C 12MB
8C 12MB shared
shared (45nm)
(45nm) QC 8/16MB
DC 3MB /6MB shared
shared (45nm) DC 3 MB/6
MB shared QC 4MB
(45nm)
DC 4MB DC 2/4MB
shared DC 16MB
DC 2/4MB
shared DC 2MB DC 4MB
SC 1MB DC 2MB
DC 2/4MB sr oss ec or p eli bo M
sr oss ec or p pot ks e D
sr oss ec or p esi r pr et n E
SC 512KB/
1/ 2MB
24
Intel TeraFlops Research Prototype
• 2KB Data Memory
• 3KB Instruction Memory
• No coherence support
• 2 FMACs
25
Intel Single-chip Cloud Computer (SCC)
Scalable many-core architecture
• Dual-core (P54C x86) tile
• 24 “tiles”
• 4 DDR3 controllers
• NoC
Georgia Tech 64-Core 3D-MAPS Many-Core Chip
27
Is a Multi-core really better off?
DEEP BLUE
29
Major Challenges for Multi-Core Designs
• Communication
– Memory hierarchy
– Data allocation (you have a large shared L2/L3 now)
– Interconnection network
• AMD HyperTransport
• Intel QPI
– Scalability
– Bus Bandwidth, how to get there?
• Power-Performance — Win or lose?
– Borkar’s multicore arguments
• 15% per core performance drop 50% power saving
• Giant, single core wastes power when task is small
– How about leakage?
• Process variation and yield
• Programming Model
30
Intel Core 2 Duo
• Homogeneous cores
• Bus based on chip
interconnect
• Shared on-die Cache
Memory
• Traditional I/O
31
Core 2 Duo Microarchitecture
32
Why Sharing on-die L2?
33
Intel Core 2 Duo (Merom)
34
TM
Core μArch — Wide Dynamic Execution
35
TM
Core μArch — Wide Dynamic Execution
36
TM
Core μArch — MACRO Fusion
38
Smart Memory Access
39
Intel Quad-Core Processor
(Kentsfield, Clovertown)
Source: Intel 40
AMD Quad-Core Processor (Barcelona)
On different
power plane
from the cores
Source: AMD
42
Intel Penryn Dual-Core (First 45nm µprocessor)
Source: Intel 43
Intel Arrandale Processor
• 32nm
• Unified 3MB L3
• Power sharing (Turbo Boost)
between cores and gfx via DFS
44
AMD 12-Core “Magny-Cours” Opteron
• 45nm
• 4 memory channels
45
Sun UltraSparc T1
• Eight cores, each 4-way threaded
• Fine-grained multithreading
– a thread-selection logic
• Take out threads that encounter
long latency events
– Round-robin cycle-by-cycle
– 4 threads in a group share a
processing pipeline (Sparc pipe)
• 1.2 GHz (90nm)
• In-order, 8 instructions per cycle (single
issue from each core)
• Caches
– 16K 4-way 32B L1-I
– 8K 4-way 16B L1-D
– Blocking cache (reason for MT)
– 4-banked 12-way 3MB L2 + 4
memory controllers. (shared by all)
– Data moved between the L2 and the
cores using an integrated crossbar
switch to provide high throughput
(200GB/s) 46
Sun UltraSparc T1
• Thread-select logic marks a thread inactive
based on
– Instruction type
• A predecode bit in the I-cache to indicate long-latency
instruction
– Misses
– Traps
– Resource conflicts
47
Sun UltraSparc T2
• A fatter version of T1
• 1.4GHz (65nm)
• 8 threads per core, 8 cores on-die
• 1 FPU per core (1 FPU per die in T1), 16 INT EU (8 in T1)
• L2 increased to 8-banked 16-way 4MB shared
• 8 stage integer pipeline ( as opposed to 6 for T1)
• 16 instructions per cycle
• One PCI Express port (x8 1.0)
• Two 10 Gigabit Ethernet ports with packet classification and filtering
• Eight encryption engines
• Four dual-channel FBDIMM memory controllers
• 711 signal I/O,1831 total
48
STI Cell Broadband Engine
• Heterogeneous!
• 9 cores, 10 threads
• 64-bit PowerPC
• Eight SPEs
– In-order, Dual-issue
– 128-bit SIMD
– 128x128b RF
– 256KB LS
– Fast Local SRAM
– Globally coherent
DMA (128B/cycle)
– 128+ concurrent
transactions to
memory per core
• High bandwidth
– EIB (96B/cycle)
49
Cell Chip Block Diagram
Synergistic
Memory flow
controller
50
BACKUP
Non-Uniform Cache Architecture
• ASPLOS 2002 proposed by UT-Austin
• Facts
– Large shared on-die L2
– Wire-delay dominating on-die cache
52
Multi-banked L2 cache
Bank=128KB
11 cycles
2MB @ 130nm
53
Multi-banked L2 cache
Bank=64KB
47 cycles
16MB @ 50nm
Bank
Data
Bus
Predecoder
Address
Bus
Sense
amplifier
Tag
Wordline driver
Array
and decoder
• Use private per-bank channel
• Each bank has its distinct access latency
• Statically decide data location for its given address
• Average access latency =34.2 cycles
• Wire overhead = 20.9% an issue
55
Static NUCA-2
Tag Array
Bank Switch
Data
bus
Predecoder
Wordline driver
and decoder