0% found this document useful (0 votes)

76 views56 pages

ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors

Uploaded by

kpriya8687

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views56 pages

ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors

Uploaded by

kpriya8687

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

ECE 4100/6100

Advanced Computer Architecture

Lecture 13 Multithreading and Multicore Processors

Prof. Hsien-Hsin Sean Lee

School of Electrical and Computer Engineering
Georgia Institute of Technology
TLP
• ILP of a single program is hard
– Large ILP is Far-flung
– We are human after all, program w/ sequential mind
• Reality: running multiple threads or programs
• Thread Level Parallelism
– Time Multiplexing
– Throughput computing
– Multiple program workloads
– Multiple concurrent threads
– Helper threads to improve single program performance

2
Multi-Tasking Paradigm
FU1 FU2 FU3 FU4
• Virtual memory makes it easy
Unused • Context switch could be
Thread 1
Execution Time Quantum

Thread 2 expensive or requires extra HW

Thread 3 – VIVT cache
Thread 4
Thread 5 – VIPT cache
– TLBs

Conventional
Superscalar
Single
Threaded

3
Multi-threading Paradigm
FU1 FU2 FU3 FU4
Unused
Thread 1
Thread 2
Execution Time

Thread 3
Thread 4
Thread 5

Conventional Fine-grained Coarse-grained Chip Simultaneous

Superscalar Multithreading Multithreading Multiprocessor Multithreading
Single (cycle-by-cycle (Block Interleaving) (CMP or (SMT)
Threaded Interleaving) MultiCore)

4
Conventional Multithreading
• Zero-overhead context switch
• Duplicated contexts for threads
0:r0

0:r7
1:r0

CtxtPtr
1:r7
2:r0

2:r7
3:r0

3:r7

Memory (shared by threads)

5
Cycle Interleaving MT
• Per-cycle, Per-thread instruction fetching
• Examples: HEP, Horizon, Tera MTA, MIT M-
machine
• Interesting questions to consider
– Does it need a sophisticated branch predictor?
– Or does it need any speculative execution at all?
• Get rid of “branch prediction”?
prediction
• Get rid of “predication”?
predication
– Does it need any out-of-order execution
capability?

6
Tera Multi-Threaded Architecture
• Cycle-by-cycle interleaving
• MTA can context-switch every cycle (3ns)
• As many as 128 distinct threads (hiding 384ns)
• 3-wide VLIW instruction format (M+ALU+ALU/Br)
• Each instruction has 3-bit for dependence lookahead
– Determine if there is dependency with subsequent instructions
– Execute up to 7 future VLIW instructions (before switch)

Loop:
nop r1=r2+r3 r5=r6+4 lookahead=1
nop r8=r9-r10 r11=r12-r13 lookahead=2
[r5]=r1 r4=r4-1 bnz Loop lookahead=0

7
Block Interleaving MT
• Context switch on a specific event (dynamic pipelining)
– Explicit switching: implementing a switch instruction
– Implicit switching: trigger when a specific instruction class fetched
• Static switching (switch upon fetching)
– Switch-on-memory-instructions: Rhamma processor
– Switch-on-branch or switch-on-hard-to-predict-branch
– Trigger can be implicit or explicit instruction
• Dynamic switching
– Switch-on-cache-miss (switch in later pipeline stage): MIT Sparcle
(MIT Alewife’s node), Rhamma Processor
– Switch-on-use (lazy strategy of switch-on-cache-miss)
• Wait until last minute
• Valid bit needed for each register
– Clear when load issued, set when data returned
– Switch-on-signal (e.g. interrupt)
– Predicated switch instruction based on conditions
• No need to support a large number of threads
8
NVidia Fermi GPGPU Architecture
Nvidia’s Streaming Multiprocessor (SM)
• SIMD execution model
• Issue one instruction from each
warp to 16 CUDA cores
• One warp = 32 parallel threads

• Compute capability 2.0 allows

1536 resident threads (i.e., 48
warps) in one SM
Simultaneous
•
Multithreading (SMT)
SMT name first used by UW; Earlier versions from UCSB [Nemirovsky, HICSS‘91] and [Hirata et al.,
ISCA-92]
• Intel’s HyperThreading (2-way SMT)
• IBM Power7 (4/6/8 cores, 4-way SMT); IBM Power5/6 (2 cores. Each 2-way SMT, 4 chips
per package) : Power5 has OoO cores, Power6 In-order cores;
• Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources

Fdiv, unpipe
(16 cycles)
Fetch RS
RS && ROB
Unit
Decode FMult
plus
plus (4 cycles)
Physical
Physical Reg
Reg
Register FAdd Reg
Reg
Register
Register Register
Register FileReg
FileReg
PC Register
Register
RRename rr (2 cyc) FileReg
FileReg
PC Register
ename
RRename
Register
rr File
File File
PC
PC ename
Register File
File
PC RRename rr File
PC
PC Rename
Renamerr
ename

ALU1
PC

ALU2
I-CACHE
Load/Store D-CACHE
(variable)
11
Instruction Fetching Policy
• FIFO, Round Robin, simple but may be too naive
• Adaptive Fetching Policies
– BRCOUNT (reduce wrong path issuing)
• Count # of br inst in decode/rename/IQ stages
• Give top priority to thread with the least BRCOUNT
– MISSCOUT (reduce IQ clog)
• Count # of outstanding D-cache misses
• Give top priority to thread with the least MISSCOUNT
– ICOUNT (reduce IQ clog)
• Count # of inst in decode/rename/IQ stages
• Give top priority to thread with the least ICOUNT
– IQPOSN (reduce IQ clog)
• Give lowest priority to those threads with inst closest to the head of INT
or FP instruction queues
– Due to that threads with the oldest instructions will be most prone to IQ clog
• No Counter needed

12
Resource Sharing
• Could be tricky when threads compete for the resources

• Static
– Less complexity
– Could penalize threads (e.g. instruction window size)
– P4’s Hyperthreading

• Dynamic
– Complex
– What is fair? How to quantify fairness?

• A growing concern in Multi-core processors

– Shared L2, Bus bandwidth, etc.
– Issues
• Fairness
• Mutual thrashing
13
P4 HyperThreading Resource Partitioning
• TC (or UROM) is alternatively accessed per cycle for
each logical processor unless one is stalled due to
TC miss
∀ µop queue (into ½) after fetched from TC
• ROB (126/2)
• LB (48/2)
• SB (24/2) (32/2 for Prescott)
• General µop queue and memory µop queue (1/2)
• TLB (½?) as there is no PID
• Retirement: alternating between 2 logical
processors
14
Alpha 21464 (EV8) Processor
Technology

• Leading edge process technology – 1.2 ~ 2.0GHz

– 0.125µm CMOS
– SOI-compatible
– Cu interconnect
– low-k dielectrics

• Chip characteristics
– ~1.2V Vdd
– ~250 Million transistors
– ~1100 signal pins in flip chip packaging
15
Alpha 21464 (EV8) Processor
Architecture

• Enhanced out-of-order execution (that giant 2Bc-gskew

predictor we discussed before is here)
• Large on-chip L2 cache
• Direct RAMBUS interface
• On-chip router for system interconnect
• Glueless, directory-based, ccNUMA for up to 512-way SMP
• 8-wide superscalar
• 4-way simultaneous multithreading (SMT)
– Total die overhead ~ 6% (allegedly)

16
SMT Pipeline

Fetch Decode/ Queue Reg Execute Dcache/ Reg Retire

Map Read Store Write
Buffer

Source: A company once called Compaq 17

EV8 SMT
• In SMT mode, it is as if there are 4 processors on a chip that
shares their caches and TLB
• Replicated hardware contexts
– Program counter
– Architected registers (actually just the renaming table
since architected registers and rename registers come
from the same physical pool)
• Shared resources
– Rename register pool (larger than needed by 1 thread)
– Instruction queue
– Caches
– TLB
– Branch predictors
• Deceased before seeing the daylight.
18
Reality Check, circa 200x
• Conventional processor designs run out of steam
– Power wall (thermal)
– Complexity (verification)
– Physics (CMOS scaling)

1000 Sun’s
Surface

Nuclear Reactor Rocket

100 Nozzle
Watts/cm 2

Hot plate
Pentium III ® processor “Surpassed hot-plate power
10 Pentium II ® processor density in 0.5µm; Not too long
Pentium Pro ® processor to reach nuclear reactor,”
Former Intel Fellow Fred
i386 Pentium ® processor Pollack.
i486
1
1.5µ 1µ 0 .7µ 0.5 µ 0.35µ 0.25 µ 0.18µ 0.13µ 0.1µ 0.07µ
19
Latest Power Density Trend

Yeo and Lee, “Peeling the Power Onion of Data Centers,” In

Energy Efficient Thermal Management of Data Centers, Springer. To appear 2011
20
Reality Check, circa 200x
• Conventional processor designs run out of steam
– Power wall (thermal)
– Complexity (verification)
– Physics (CMOS scaling)
• Unanimous direction  Multi-core
– Simple cores (massive number)
– Keep
• Wire communication on leash
• Gordon Moore happy (Moore’s Law)
– Architects’ menace: kick the ball to the other side of the court?
• What do you (or your customers) want?
– Performance (and/or availability)
– Throughput > latency (turnaround time)
– Total cost of ownership (performance per dollar)
– Energy (performance per watt)
– Reliability and dependability, SPAM/spy free

21
Multi-core Processor Gala

22
Intel’s Multicore Roadmap

8C 12MB
8C 12MB shared
shared (45nm)
(45nm) QC 8/16MB
DC 3MB /6MB shared
shared (45nm) DC 3 MB/6
MB shared QC 4MB
(45nm)
DC 4MB DC 2/4MB
shared DC 16MB
DC 2/4MB
shared DC 2MB DC 4MB
SC 1MB DC 2MB
DC 2/4MB sr oss ec or p eli bo M
sr oss ec or p pot ks e D

sr oss ec or p esi r pr et n E
SC 512KB/
1/ 2MB

2006 2007 2008 2006 2007 2008

2006 2007 2008

Source: Adapted from Tom’s Hardware

• To extend Moore’s Law

• To delay the ultimate limit of physics
• By 2010
– all Intel processors delivered will be multicore
– Intel’s 80-core processor (FPU array)
23
Is a Multi-core really better off?

If you were plowing a field,

which would you rather use:
Two strong oxen or 1024 chickens?
--- Seymour Cray

Well, it is hard to say in Computing World

24
Intel TeraFlops Research Prototype
• 2KB Data Memory
• 3KB Instruction Memory
• No coherence support
• 2 FMACs

• Next-gen had 3D-

integrated memory
– SRAM first
– Then DRAM
– Intel did not report
further result

25
Intel Single-chip Cloud Computer (SCC)
Scalable many-core architecture
• Dual-core (P54C x86) tile
• 24 “tiles”

Advanced power management

• Each tile can run at their
own frequency
• Groupings of 4 tiles can run
at their own voltage
• 25W to 125W

• 4 DDR3 controllers
• NoC
Georgia Tech 64-Core 3D-MAPS Many-Core Chip

• 3D-stacked many-core processor

• Fast, high-density face-to-face vias for high bandwidth
• Wafer-to-wafer bonding
• @277MHz, peak data B/W ~ 70.9GB/sec

Data SRAM Single Core

F2F via bus

2-way VLIW core

Single SRAM tile

27
Is a Multi-core really better off?

DEEP BLUE

480 chess chips

Can evaluate 200,000,000 moves per second!!
28
IBM Watson Jeopardy! Competition (2011.2.)
• POWER7 chips (2,880 cores) + 16TB memory
• Massively parallel processing
• Combine: Processing power, Natural language processing,
AI, Search, Knowledge extraction

29
Major Challenges for Multi-Core Designs
• Communication
– Memory hierarchy
– Data allocation (you have a large shared L2/L3 now)
– Interconnection network
• AMD HyperTransport
• Intel QPI
– Scalability
– Bus Bandwidth, how to get there?
• Power-Performance — Win or lose?
– Borkar’s multicore arguments
• 15% per core performance drop  50% power saving
• Giant, single core wastes power when task is small
– How about leakage?
• Process variation and yield
• Programming Model

30
Intel Core 2 Duo
• Homogeneous cores
• Bus based on chip
interconnect
• Shared on-die Cache
Memory
• Traditional I/O

Classic OOO: Reservation Stations,

Issue ports, Schedulers…etc Source: Intel Corp.

Large, shared set associative, prefetch,

etc.

31
Core 2 Duo Microarchitecture

32
Why Sharing on-die L2?

• What happens when L2 is too large?

33
Intel Core 2 Duo (Merom)

34
TM
Core μArch — Wide Dynamic Execution

35
TM
Core μArch — Wide Dynamic Execution

36
TM
Core μArch — MACRO Fusion

• Common “Intel 32” instruction pairs are combined

• 4-1-1-1 decoder that sustains 7 μop’s per cycle
• 4+1 = 5 “Intel 32” instructions per cycle
37
Micro(-ops) Fusion (from Pentium M)
• A misnomer..
• Instead of breaking up an Intel32 instruction into μop, they decide not to
break it up…
• A better naming scheme would call the previous techniques — “IA32
fission”
• To fuse
– Store address and store data μops
– Load-and-op μops (e.g. ADD (%esp), %eax)
• Extend each RS entry to take 3 operands
• To reduce
– micro-ops (10% reduction in the OOO logic)
– Decoder bandwidth (simple decoder can decode fusion type
instruction)
– Energy consumption
• Performance improved by 5% for INT and 9% for FP (Pentium M data)

38
Smart Memory Access

39
Intel Quad-Core Processor
(Kentsfield, Clovertown)

Source: Intel 40
AMD Quad-Core Processor (Barcelona)

On different
power plane
from the cores

• True 128-bit SSE (as opposed 64 in prior Opteron)

• Sideband Stack optimizer
– Parallelize many POPes and PUSHes (which were dependent on each other)
• Convert them into pure loads/store instructions
– No uops in FUs for stack pointer adjustment
Source: AMD 41
Barcelona’s Cache Architecture

Source: AMD
42
Intel Penryn Dual-Core (First 45nm µprocessor)

• High K dielectric metal gate • Up to 12MB L2

• 47 new SSE4 ISA • > 3GHz

Source: Intel 43
Intel Arrandale Processor

• 32nm
• Unified 3MB L3
• Power sharing (Turbo Boost)
between cores and gfx via DFS

44
AMD 12-Core “Magny-Cours” Opteron

• 45nm
• 4 memory channels
45
Sun UltraSparc T1
• Eight cores, each 4-way threaded
• Fine-grained multithreading
– a thread-selection logic
• Take out threads that encounter
long latency events
– Round-robin cycle-by-cycle
– 4 threads in a group share a
processing pipeline (Sparc pipe)
• 1.2 GHz (90nm)
• In-order, 8 instructions per cycle (single
issue from each core)
• Caches
– 16K 4-way 32B L1-I
– 8K 4-way 16B L1-D
– Blocking cache (reason for MT)
– 4-banked 12-way 3MB L2 + 4
memory controllers. (shared by all)
– Data moved between the L2 and the
cores using an integrated crossbar
switch to provide high throughput
(200GB/s) 46
Sun UltraSparc T1
• Thread-select logic marks a thread inactive
based on
– Instruction type
• A predecode bit in the I-cache to indicate long-latency
instruction
– Misses
– Traps
– Resource conflicts

47
Sun UltraSparc T2

• A fatter version of T1
• 1.4GHz (65nm)
• 8 threads per core, 8 cores on-die
• 1 FPU per core (1 FPU per die in T1), 16 INT EU (8 in T1)
• L2 increased to 8-banked 16-way 4MB shared
• 8 stage integer pipeline ( as opposed to 6 for T1)
• 16 instructions per cycle
• One PCI Express port (x8 1.0)
• Two 10 Gigabit Ethernet ports with packet classification and filtering
• Eight encryption engines
• Four dual-channel FBDIMM memory controllers
• 711 signal I/O,1831 total

48
STI Cell Broadband Engine
• Heterogeneous!
• 9 cores, 10 threads
• 64-bit PowerPC
• Eight SPEs
– In-order, Dual-issue
– 128-bit SIMD
– 128x128b RF
– 256KB LS
– Fast Local SRAM
– Globally coherent
DMA (128B/cycle)
– 128+ concurrent
transactions to
memory per core
• High bandwidth
– EIB (96B/cycle)

49
Cell Chip Block Diagram

Synergistic
Memory flow
controller

50
BACKUP
Non-Uniform Cache Architecture
• ASPLOS 2002 proposed by UT-Austin
• Facts
– Large shared on-die L2
– Wire-delay dominating on-die cache

3 cycles 11 cycles 24 cycles

1MB 4MB 16MB
180nm, 1999 90nm, 2004 50nm, 2010

52
Multi-banked L2 cache

Bank=128KB
11 cycles

2MB @ 130nm

Bank Access time = 3 cycles

Interconnect delay = 8 cycles

53
Multi-banked L2 cache

Bank=64KB
47 cycles

16MB @ 50nm

Bank Access time = 3 cycles

Interconnect delay = 44 cycles
54
Static NUCA-1
Sub-bank

Bank
Data
Bus

Predecoder
Address
Bus
Sense
amplifier
Tag
Wordline driver
Array
and decoder
• Use private per-bank channel
• Each bank has its distinct access latency
• Statically decide data location for its given address
• Average access latency =34.2 cycles
• Wire overhead = 20.9%  an issue
55
Static NUCA-2
Tag Array
Bank Switch

Data
bus

Predecoder

Wordline driver
and decoder

• Use a 2D switched network to alleviate wire area overhead

• Average access latency =24.2 cycles
• Wire overhead = 5.9%
56

DI1 Si210 - PDF
100% (2)
DI1 Si210 - PDF
110 pages
Intel 80586 (Pentium)
100% (3)
Intel 80586 (Pentium)
24 pages
Unit 5
No ratings yet
Unit 5
86 pages
Multi-Core Architectures
100% (1)
Multi-Core Architectures
43 pages
EEE415-Week07-Micro Architecture and Memory
No ratings yet
EEE415-Week07-Micro Architecture and Memory
50 pages
EE457Unit9c CMT
No ratings yet
EE457Unit9c CMT
60 pages
Comp422 534 2020 Lecture1 Introduction
No ratings yet
Comp422 534 2020 Lecture1 Introduction
49 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Memory Coherent
No ratings yet
Memory Coherent
62 pages
EE6304 Lecture12 TLP
No ratings yet
EE6304 Lecture12 TLP
70 pages
System-On-Chip (Soc) Architecture Soc Example
No ratings yet
System-On-Chip (Soc) Architecture Soc Example
71 pages
Unit IV QB With Answers
No ratings yet
Unit IV QB With Answers
16 pages
RG1 Intro ParallelArch HPCAI Jan2020
No ratings yet
RG1 Intro ParallelArch HPCAI Jan2020
47 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
33 pages
Lecture1 Introduction To Parallel Computing - 2025
No ratings yet
Lecture1 Introduction To Parallel Computing - 2025
38 pages
CA Lecture 12
No ratings yet
CA Lecture 12
48 pages
Comp422 2011 Lecture1 Introduction
No ratings yet
Comp422 2011 Lecture1 Introduction
50 pages
Unit VI - Multi Core Architectures
No ratings yet
Unit VI - Multi Core Architectures
51 pages
Multithreading Architectures: Computer Science & Artificial Intelligence Lab M.I.T
No ratings yet
Multithreading Architectures: Computer Science & Artificial Intelligence Lab M.I.T
31 pages
l23 Multithread
No ratings yet
l23 Multithread
34 pages
Lecture19 ILP SMT
No ratings yet
Lecture19 ILP SMT
31 pages
Presentation On Multithreading/Vector
No ratings yet
Presentation On Multithreading/Vector
7 pages
20 Advanced Processor Designs
No ratings yet
20 Advanced Processor Designs
28 pages
Future Processors To Use Coarse-Grain Parallelism
No ratings yet
Future Processors To Use Coarse-Grain Parallelism
48 pages
Unit 1 Modern Processors
No ratings yet
Unit 1 Modern Processors
52 pages
Flynns Taxonomy
0% (1)
Flynns Taxonomy
79 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
SSC Course 6 CPU
No ratings yet
SSC Course 6 CPU
17 pages
CC Unit 1
No ratings yet
CC Unit 1
24 pages
15th Lecture 6. Future Processors To Use Coarse-Grain Parallelism
No ratings yet
15th Lecture 6. Future Processors To Use Coarse-Grain Parallelism
35 pages
Superscalar Architectures
No ratings yet
Superscalar Architectures
36 pages
Multi Thread2
No ratings yet
Multi Thread2
37 pages
Pipeline History
No ratings yet
Pipeline History
30 pages
Multithreading, SMT and CMP
No ratings yet
Multithreading, SMT and CMP
7 pages
Osa Multi Core
No ratings yet
Osa Multi Core
37 pages
18 Multicore Computers
0% (1)
18 Multicore Computers
31 pages
03 TLP
No ratings yet
03 TLP
33 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
Module 2
No ratings yet
Module 2
127 pages
MULTITHREADING
No ratings yet
MULTITHREADING
30 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
CRGC Mcore PDF
No ratings yet
CRGC Mcore PDF
124 pages
Computer Organisation (15CS34) New Syllabus: Notes
100% (1)
Computer Organisation (15CS34) New Syllabus: Notes
119 pages
Parallelism (2) & Heterogeneous Computing & Future Perspetives
No ratings yet
Parallelism (2) & Heterogeneous Computing & Future Perspetives
50 pages
Antenna Design
No ratings yet
Antenna Design
6 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Getting More Out of Processors: Everyone Wants To Compute Faster, But How?
No ratings yet
Getting More Out of Processors: Everyone Wants To Compute Faster, But How?
8 pages
Lec 4 Superscalarprocessor Updated PDF
No ratings yet
Lec 4 Superscalarprocessor Updated PDF
40 pages
Advanced Computer Architecture: Program Flow Mechanisms
No ratings yet
Advanced Computer Architecture: Program Flow Mechanisms
14 pages
Multi-Core Computing: Osama Awwad
No ratings yet
Multi-Core Computing: Osama Awwad
37 pages
Real Time System Lect10 A
No ratings yet
Real Time System Lect10 A
25 pages
Multilevel Viewpoint of A Machine
100% (1)
Multilevel Viewpoint of A Machine
4 pages
Multi Core 15213 Sp07
No ratings yet
Multi Core 15213 Sp07
67 pages
V1000 Datasheet
No ratings yet
V1000 Datasheet
12 pages
IES - Electrical Engineering - Analog and Digital Circuits
No ratings yet
IES - Electrical Engineering - Analog and Digital Circuits
81 pages
SMT and CMP Architectures
100% (3)
SMT and CMP Architectures
19 pages
DSD Mod-3 Notes 18EC34
No ratings yet
DSD Mod-3 Notes 18EC34
34 pages
Inside The Cpu
No ratings yet
Inside The Cpu
10 pages
Defining Computer Architecture
No ratings yet
Defining Computer Architecture
6 pages
Hardware Multithreading
No ratings yet
Hardware Multithreading
22 pages
SMT and CMP Architectures
No ratings yet
SMT and CMP Architectures
19 pages
CS252 Graduate Computer Architecture Multithreading / Vector Processing March 2, 2011
No ratings yet
CS252 Graduate Computer Architecture Multithreading / Vector Processing March 2, 2011
26 pages
Computer Case
No ratings yet
Computer Case
11 pages
SN8P2722A Sonix
No ratings yet
SN8P2722A Sonix
107 pages
14 - Selection Circuit 1 of 3 - Solution - ENG
No ratings yet
14 - Selection Circuit 1 of 3 - Solution - ENG
3 pages
Vmax
No ratings yet
Vmax
10 pages
LS06W Service Manual Australia NEC MST9E89
100% (1)
LS06W Service Manual Australia NEC MST9E89
37 pages
Introduction To Interfacing Techniques & Data Transfer Schemes
No ratings yet
Introduction To Interfacing Techniques & Data Transfer Schemes
19 pages
DSP Builder Handbook Volume 1: Introduction To DSP Builder
No ratings yet
DSP Builder Handbook Volume 1: Introduction To DSP Builder
20 pages
2013bca139 160828094718 PDF
No ratings yet
2013bca139 160828094718 PDF
40 pages
Foxconn Chicago
No ratings yet
Foxconn Chicago
42 pages
Lenovo g530 n500 - Compal La-4212p Jiwa3 Jiwa4 - Rev 1
0% (1)
Lenovo g530 n500 - Compal La-4212p Jiwa3 Jiwa4 - Rev 1
53 pages
Brosur Peritem New
No ratings yet
Brosur Peritem New
2 pages
2 CMOS Inverter
No ratings yet
2 CMOS Inverter
31 pages
TMS320C5535, C5534, C5533, C5532 Fixed-Point Digital Signal Processors
No ratings yet
TMS320C5535, C5534, C5533, C5532 Fixed-Point Digital Signal Processors
152 pages
Synthesis
No ratings yet
Synthesis
67 pages
Universal Serial Bus.: Rainny
No ratings yet
Universal Serial Bus.: Rainny
22 pages
Graphics Processing Unit (Gpu) : BY Amal Raj.R Electronics C.P.T.C
No ratings yet
Graphics Processing Unit (Gpu) : BY Amal Raj.R Electronics C.P.T.C
30 pages
ST7586 S Rev 1 6 DS
No ratings yet
ST7586 S Rev 1 6 DS
67 pages
Unit 2 Combinational Circuits
No ratings yet
Unit 2 Combinational Circuits
29 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
67 pages
Implementation of Pseudo-Noise Sequence Generator On FPGA Using Verilog
No ratings yet
Implementation of Pseudo-Noise Sequence Generator On FPGA Using Verilog
6 pages
04.01 PC Dis Re Assembly
No ratings yet
04.01 PC Dis Re Assembly
19 pages
Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 14 Bus Structure
No ratings yet
Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 14 Bus Structure
37 pages
Touch Screen Omron
No ratings yet
Touch Screen Omron
6 pages
Khan Muhammad Nafee Mostafa: Presented by
No ratings yet
Khan Muhammad Nafee Mostafa: Presented by
20 pages
Blackcat Usb
No ratings yet
Blackcat Usb
25 pages
Chapter 3 Lecture One
No ratings yet
Chapter 3 Lecture One
21 pages
Graphic Processing Unit Market
No ratings yet
Graphic Processing Unit Market
5 pages
Ripple Carry Adder: Presented By: Aravindreddy M 18951A0582
No ratings yet
Ripple Carry Adder: Presented By: Aravindreddy M 18951A0582
11 pages
Ripple Carry A: BY V Manohar Ece Dept
No ratings yet
Ripple Carry A: BY V Manohar Ece Dept
7 pages
CAB202 Assessment2 CriteriaAndStandards 2024se1 Rev0
No ratings yet
CAB202 Assessment2 CriteriaAndStandards 2024se1 Rev0
4 pages
Computer-I Topper+Elite T-3
No ratings yet
Computer-I Topper+Elite T-3
2 pages
Aptio 4.x Status Codes: Checkpoints & Beep Codes For Debugging
No ratings yet
Aptio 4.x Status Codes: Checkpoints & Beep Codes For Debugging
12 pages
Emitter-Coupled Logic Element Simulation PDF
No ratings yet
Emitter-Coupled Logic Element Simulation PDF
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors

Uploaded by

ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors

Uploaded by

ECE 4100/6100

Advanced Computer Architecture

Lecture 13 Multithreading and Multicore Processors

Prof. Hsien-Hsin Sean Lee

Thread 2 expensive or requires extra HW

Conventional Fine-grained Coarse-grained Chip Simultaneous

Memory (shared by threads)

• Compute capability 2.0 allows

• A growing concern in Multi-core processors

• Leading edge process technology – 1.2 ~ 2.0GHz

• Enhanced out-of-order execution (that giant 2Bc-gskew

Fetch Decode/ Queue Reg Execute Dcache/ Reg Retire

Source: A company once called Compaq 17

Nuclear Reactor Rocket

Yeo and Lee, “Peeling the Power Onion of Data Centers,” In

2006 2007 2008 2006 2007 2008

Source: Adapted from Tom’s Hardware

• To extend Moore’s Law

If you were plowing a field,

Well, it is hard to say in Computing World

• Next-gen had 3D-

Advanced power management

• 3D-stacked many-core processor

Data SRAM Single Core

F2F via bus

2-way VLIW core

480 chess chips

Classic OOO: Reservation Stations,

Large, shared set associative, prefetch,

• What happens when L2 is too large?

• Common “Intel 32” instruction pairs are combined

• True 128-bit SSE (as opposed 64 in prior Opteron)

• High K dielectric metal gate • Up to 12MB L2

3 cycles 11 cycles 24 cycles

Bank Access time = 3 cycles

Bank Access time = 3 cycles

• Use a 2D switched network to alleviate wire area overhead

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.