Design and Management of 3D CMP's Using Network-in-Memory: Ashok Ayyamani
Design and Management of 3D CMP's Using Network-in-Memory: Ashok Ayyamani
using Network-in-Memory
Ashok Ayyamani
What is this paper about
Architecture for multiprocessors
Large shared L2 caches
non uniform access times
Placement – 3D Caches and CPU
Reduce the hit latency
Improve IPC
Interconnection of CPU and cache nodes
Router + bus
2
NUCA
Minimize hit time
for large capacity cache
For highly-associative caches
Each bank has is own distinct address and
latency.
Faster access to closer banks
NUCA
Dynamic (Frequent data closer to CPU)
Static (data placement depends on address)
Variants
3
Network-in-Memory
Why
Large caches increase hit times
Divide them into banks
Self contained
Address individually
Banks must be interconnected
efficiently
Bus
Networks-on-Chip
4
Interconnection with bus
With increasing nodes, resource
contention becomes an issue.
So performance degrades if we increase
nodes.
Not scalable !!
Transactional by nature
Solution
Networks-on-Chip
5
Networks-on-Chip
On chip network
Scalable
Example - Mesh
Each Node has “link” to a dedicated router
Each router has a “link” to 4 Neighbors
“link” - Two unidirectional links with width equal to flit
size. (in context)
Flit – unit of transfer into which packets are broken
for transmission
Bus or Router ??
Hybrid ?
Will come back to this later. Nothing is perfect.
6
Networks-on-Chip
7
3-D Design
Problem with large Networks-on-Chip
So many routers increase communication
delay even with state-of-the-art routers.
Objective is to reduce hop count which is
always not possible.
Solution –
Try to stack them up in 3D, so that there
are more banks accessible within fewer
hops than in 2D.
8
Benefits – 3D
Higher Packaging Density
Higher Performance
due to reduced average interconnect
length
Lower inter-connect power
consumption
due to reduced total wiring length
9
3D Technologies
Wafer Bonding
Process active device layers separately
Interconnect them at the end
This paper uses this technology
Multi Layer Buried Structures (MLBS)
Front end process (??) repeats on a
single wafer to build multiple device
layers
Back end process builds the
interconnects
10
Wafer Orientation
Face to Face
Suitable for 2 layers
More than 2 layers get complicated
Larger and longer vias
Face to Back
More Layers
Reduced inter layer via density
11
Wafer Orientation
12
3D - Issues
Via Insulation
Inter layer Via Pitch
state-of-the-art is 0.2 x 0.2 micron sq
using Silicon-on-Insulator
Via pads (end points) limit via density
Bottom Line
Despite lower densities, they provide
faster data transfer times than 2D wire
interconnects
13
Via pitch
From http://www.ltcc.de/en/whatis_des.php
A – Via pad
E – Via Pitch
Rest - NA
14
3D Network-in-Memory
Very small distance between layers
Router vs bus
Routers are multi-hop and with increasing
links (up and down), the blocking
probability increases.
Solution – Single hop communication
medium – bus !!
Intra Layer Communication – Routers
Inter Layer Communication – dTDMA Bus
15
3D Network-in-Memory
16
dTDMA Bus
Dynamic TDMA
Dynamic allocation of time slots
Provides rapid communication between
layers of the chip
dTDMA bus interface
Transmitter and Receiver connected to
bus through a tri-state driver
Tri-state drivers are controlled by
independently programmed feedback
shift registers
17
Arbitration
18
Arbitration
Each Pillar needs an arbiter
Arbiter should be in middle so that wire
distance is as uniform as possible
Number of control wires increase with the
no. of layers.
So keep the number of layers at minimum
Apparently (after experiments), dTDMA bus
was more efficient with respect to area and
power than conventional NoC Routers.
(Tables in next slide).
19
Arbitration
20
Area and Power Overhead of
dTDMA Bus
Number of Layers should be kept minimum
for reasons mentioned in last slide
Another reason – bus contention
21
Limitations
Area occupied by pillar is wasted device area.
Keep number of inter layer connections low
Translates into reduced pillars
With increasing via density, more vias are feasible
Again !! Density limited by via pads (endpoints)
Router Complexity goes up
More number of ports Increased blocking
probability
This paper has normal routers (5 ports) + hybrid
routers (6 ports – inter layer)
Extra port is due to the vertical link
22
NoC Router Architecture
Generally, a router has
Routing Unit (RT)
Virtual Channel Allocation Unit (VA)
Switch Allocation Unit (SA)
Crossbar (XBAR)
In Mesh topology, we have 5 physical
channels per processing element (PE)
Virtual channels that are FIFO buffers
hold flits from pending messages
23
NoC Router Architecture
The paper uses
3 VC’s per PC, each 1 message deep
Each message is 4 flits
Width (b) of the router links is 128 bits
4 flits/packet x 128 bits/flit = 512
bits/packet = 64 bytes/packet. So a 64
byte cache line will fit in one packet
24
NoC Router Architecture
The paper uses
Single stage router
Generally, it takes one cycle per stage
So, 4 cycles vs 1 cycle in this paper
Aggressive ? May be
Look Ahead Routing and Speculative channel
Allocation can reduce this
Low Latency is very important
Router connected to pillar nodes are different
They have an extra physical channel that
corresponds to FIFO buffers for the vertical link.
The Router just sees it as an additional physical
channel.
25
NoC Router Architecture
26
CPU Placement
Each CPU has a dedicated pillar for
fast inter layer access
CPU’s can share pillars. But not in this
paper.
So, we assume instant access to the
pillar and all cache banks on the pillar
Memory locality + vertical locality
27
CPU Placement
28
CPU Placement
Thermal Issues
Major problem in 3D
CPU’s consume most of power
So it makes sense not to place them on top of
each other in the stack
Congestion
CPU’s generate most L2 Traffic (rest due to data
migration).
If we place them one over the other, we will
have more congestion since they share the
same pillar.
Maximal offsetting
29
CPU Placement
30
CPU Placement
If Via density is low
Lesser pillars than CPU cores
Sharing of Pillars inevitable
Intelligent Placements
Not so far from pillars (faster access to
pillars)
Minimum thermal effects
31
CPU Placement Algorithm
32
CPU Placement Algorithm
k=1 in the experiments
k can be increased at the expense of performance
Desirable to have Lower ‘c’
Less contention
Better Network Performance
Location of pillars predetermined
Pillars should be as far as possible to reduce
congested areas
Not in edges, because, this will limit number of cache
banks around the pillars
Placement pattern spans 4 layers beyond which they
are repeated
Thermal effects reduces with inter layer distance
33
Thermal Aware CPU Placement
34
Thermal Profile – Hotspots – HS3d
35
3D L2 Cache Management
Cache banks are divided in clusters
Cluster contains a set of cache banks
Separate tag array for all cache lines in the
cluster
All banks in a cluster are connected by an NoC
Tag array has direct connection to processor
array
Clusters without local processors have
customized logic block
for receiving cache requests
Searching tag array
Forwarding request to target cache bank
36
Cache Management Policies
Cache Line Search
Cache Placement
Cache Replacement
Cache Line Migration
37
Cache Line Search
Two Step Process
(1) Processor searches local tag array in the
cluster
Also sends request to neighbors (also the
vertical neighbors through the pillars)
(2) If not found in any of these places, processor
multicasts the request to remaining clusters
If tag match fails in all clusters, it is considered
an L2 Miss
If there is a match, the corresponding data is
routed to requesting processor through NoC
38
Placement and Replacement
Lower Order Bits of the Cache tag
indicate the cluster
Lower order bits of cache index
indicate the bank
Remaining bits indicate the precise
location in the bank
Pseudo LRU policy is used for
replacement
39
Cache Line Migration
Intra Layer Data Migration
Data is migrated to cluster close to accessing
CPU
Clusters that have processors are skipped
This is done to prevent any effects on L2 access
patterns of the local CPU in that cluster
Eventually data migrates to the cluster of the
processor
Because of repeated access
40
Intra Layer Data Migration
41
Cache Line Migration
Inter Layer Data Migration
Data is migrated closer to the pillar near
the accessing CPU.
Assumption – Clusters near the same
pillar in different layers are considered
local.
No inter layer data migration
This also helps reduce power.
42
Inter Layer Data Migration
43
Cache Line Migration
Lazy Migration
To prevent False Misses
False misses are those that are caused by
searches for data in Migration
False Misses occur because of repeated
access of a few “hot” blocks by multiple
processors.
Solution – Delay Migration by a few cycles
Cancel Migration when a different Processor
accesses the same block
44
Experiment Methodology
Simics with 3D NoC Simulator
8 Processor CMP
Solaris 9
In order issue, SPARC ISA
Private L1 Cache, shared large L2
Cacti 3.2
dTDMA was integrated into 2D NoC
Simulator as the vertical channel
L1 Cache Coherence traffic was taken into
account
45
System Configuration
46
Benchmarks
Application was run 500 million cycles for
L2 warm up
Statistics were collected for next 2 billion
cycles
Table shows L2 cache accesses
47
Results
Legend
CMP-DNUCA – Conventional perfect
search
CMP-DNUCA-2D - CMP-DNUCA-3D with a
single layer
CMP-DNUCA-3D – Proposed architecture
with data migration
CMP-SNUCA-3D – Proposed architecture
without data migration
48
Average L2 Hit Latency
49
Number of block Migrations
50
IPC
51
Average L2 Hit Latency under
different cache sizes
52
Effect of Number of Pillars
53
Impact of Number of Layers
54
Conclusion
3D NoC architecture reduces average L2
access latency
This improves IPC
3D is better than 2D
even without data migration
Placement of processors in 3D needs to
consider thermal issues carefully
Number of pillars should be chosen
carefully
This in turn can affect congestion and bandwidth
55
Strength
Novel Architecture
Solves access time issues
Includes Thermal issues
And tries to mitigate the same by proper
CPU placement
Hybrid Network-in-Memory
Router + bus
Adopts dTDMA for efficient channel usage
56
Weakness
The paper assumes one CPU per pillar
As the number of CPU’s increase, this assumption
may not be true
Via density does not increase and hence number of
pillars are fixed
So CPU’s may have to share pillars
The paper does not discuss the effect on L2 Latency
because of this sharing of pillars.
Assumes a single stage router
May not be always practical/feasible
Thermal Aware CPU Placement
What is the assumption on the heat flow.
Uniform ? Or not ?
57
Things I did not understand
MLBS
58
Next Paper can be
L2 performance degradation because
of this sharing of pillars
Face to Face wafer bonding
Back-to-face results in more wasted area
MLBS
Effect of Router Speed
Paper assumed single cycle for all four
stages
59
Questions
Thank you
60