0% found this document useful (0 votes)
83 views60 pages

Design and Management of 3D CMP's Using Network-in-Memory: Ashok Ayyamani

This paper proposes a 3D network-in-memory architecture for chip multiprocessors (CMPs) using a dynamic time division multiple access (dTDMA) bus. The architecture stacks cache banks in 3D to reduce hit latency and improve performance. Caches are divided into clusters, each with its own tag array. Processors are placed on dedicated pillars for fast access to cache banks across layers. Intra-layer communication uses routers while inter-layer uses the more efficient dTDMA bus. Thermal issues from 3D stacking are addressed through processor placement algorithms.

Uploaded by

Cecilia Chinna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views60 pages

Design and Management of 3D CMP's Using Network-in-Memory: Ashok Ayyamani

This paper proposes a 3D network-in-memory architecture for chip multiprocessors (CMPs) using a dynamic time division multiple access (dTDMA) bus. The architecture stacks cache banks in 3D to reduce hit latency and improve performance. Caches are divided into clusters, each with its own tag array. Processors are placed on dedicated pillars for fast access to cache banks across layers. Intra-layer communication uses routers while inter-layer uses the more efficient dTDMA bus. Thermal issues from 3D stacking are addressed through processor placement algorithms.

Uploaded by

Cecilia Chinna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 60

Design and Management of 3D CMP’s

using Network-in-Memory

Ashok Ayyamani
What is this paper about
 Architecture for multiprocessors
 Large shared L2 caches
 non uniform access times
 Placement – 3D Caches and CPU
 Reduce the hit latency
 Improve IPC
 Interconnection of CPU and cache nodes
 Router + bus

 3D NoC based non-uniform L2 Cache


architecture

2
NUCA
 Minimize hit time
 for large capacity cache
 For highly-associative caches
 Each bank has is own distinct address and
latency.
 Faster access to closer banks
 NUCA
 Dynamic (Frequent data closer to CPU)
 Static (data placement depends on address)
 Variants

3
Network-in-Memory
 Why
 Large caches increase hit times
 Divide them into banks
 Self contained
 Address individually
 Banks must be interconnected
efficiently
 Bus
 Networks-on-Chip

4
Interconnection with bus
 With increasing nodes, resource
contention becomes an issue.
 So performance degrades if we increase
nodes.
 Not scalable !!
 Transactional by nature
 Solution
 Networks-on-Chip

5
Networks-on-Chip
 On chip network
 Scalable
 Example - Mesh
 Each Node has “link” to a dedicated router
 Each router has a “link” to 4 Neighbors
 “link” - Two unidirectional links with width equal to flit
size. (in context)
 Flit – unit of transfer into which packets are broken
for transmission
 Bus or Router ??
 Hybrid ?
 Will come back to this later. Nothing is perfect.

6
Networks-on-Chip

7
3-D Design
 Problem with large Networks-on-Chip
 So many routers increase communication
delay even with state-of-the-art routers.
 Objective is to reduce hop count which is
always not possible.
 Solution –
 Try to stack them up in 3D, so that there
are more banks accessible within fewer
hops than in 2D.

8
Benefits – 3D
 Higher Packaging Density
 Higher Performance
 due to reduced average interconnect
length
 Lower inter-connect power
consumption
 due to reduced total wiring length

9
3D Technologies
 Wafer Bonding
 Process active device layers separately
 Interconnect them at the end
 This paper uses this technology
 Multi Layer Buried Structures (MLBS)
 Front end process (??) repeats on a
single wafer to build multiple device
layers
 Back end process builds the
interconnects
10
Wafer Orientation
 Face to Face
 Suitable for 2 layers
 More than 2 layers get complicated
 Larger and longer vias
 Face to Back
 More Layers
 Reduced inter layer via density

11
Wafer Orientation

12
3D - Issues
 Via Insulation
 Inter layer Via Pitch
 state-of-the-art is 0.2 x 0.2 micron sq
using Silicon-on-Insulator
 Via pads (end points) limit via density
 Bottom Line
 Despite lower densities, they provide
faster data transfer times than 2D wire
interconnects
13
Via pitch

From http://www.ltcc.de/en/whatis_des.php

A – Via pad
E – Via Pitch
Rest - NA

14
3D Network-in-Memory
 Very small distance between layers
 Router vs bus
 Routers are multi-hop and with increasing
links (up and down), the blocking
probability increases.
 Solution – Single hop communication
medium – bus !!
 Intra Layer Communication – Routers
 Inter Layer Communication – dTDMA Bus

15
3D Network-in-Memory

16
dTDMA Bus
 Dynamic TDMA
 Dynamic allocation of time slots
 Provides rapid communication between
layers of the chip
 dTDMA bus interface
 Transmitter and Receiver connected to
bus through a tri-state driver
 Tri-state drivers are controlled by
independently programmed feedback
shift registers
17
Arbitration

18
Arbitration
 Each Pillar needs an arbiter
 Arbiter should be in middle so that wire
distance is as uniform as possible
 Number of control wires increase with the
no. of layers.
 So keep the number of layers at minimum
 Apparently (after experiments), dTDMA bus
was more efficient with respect to area and
power than conventional NoC Routers.
(Tables in next slide).

19
Arbitration

20
Area and Power Overhead of
dTDMA Bus
 Number of Layers should be kept minimum
for reasons mentioned in last slide
 Another reason – bus contention

21
Limitations
 Area occupied by pillar is wasted device area.
 Keep number of inter layer connections low
 Translates into reduced pillars
 With increasing via density, more vias are feasible
 Again !! Density limited by via pads (endpoints)
 Router Complexity goes up
 More number of ports  Increased blocking
probability
 This paper has normal routers (5 ports) + hybrid
routers (6 ports – inter layer)
 Extra port is due to the vertical link

22
NoC Router Architecture
 Generally, a router has
 Routing Unit (RT)
 Virtual Channel Allocation Unit (VA)
 Switch Allocation Unit (SA)
 Crossbar (XBAR)
 In Mesh topology, we have 5 physical
channels per processing element (PE)
 Virtual channels that are FIFO buffers
hold flits from pending messages
23
NoC Router Architecture
 The paper uses
 3 VC’s per PC, each 1 message deep
 Each message is 4 flits
 Width (b) of the router links is 128 bits
 4 flits/packet x 128 bits/flit = 512
bits/packet = 64 bytes/packet. So a 64
byte cache line will fit in one packet

24
NoC Router Architecture
 The paper uses
 Single stage router
 Generally, it takes one cycle per stage
 So, 4 cycles vs 1 cycle in this paper
 Aggressive ? May be
 Look Ahead Routing and Speculative channel
Allocation can reduce this
 Low Latency is very important
 Router connected to pillar nodes are different
 They have an extra physical channel that
corresponds to FIFO buffers for the vertical link.
 The Router just sees it as an additional physical
channel.

25
NoC Router Architecture

26
CPU Placement
 Each CPU has a dedicated pillar for
fast inter layer access
 CPU’s can share pillars. But not in this
paper.
 So, we assume instant access to the
pillar and all cache banks on the pillar
 Memory locality + vertical locality

27
CPU Placement

28
CPU Placement
 Thermal Issues
 Major problem in 3D
 CPU’s consume most of power
 So it makes sense not to place them on top of
each other in the stack
 Congestion
 CPU’s generate most L2 Traffic (rest due to data
migration).
 If we place them one over the other, we will
have more congestion since they share the
same pillar.
 Maximal offsetting

29
CPU Placement

30
CPU Placement
 If Via density is low
 Lesser pillars than CPU cores
 Sharing of Pillars inevitable
 Intelligent Placements
 Not so far from pillars (faster access to
pillars)
 Minimum thermal effects

31
CPU Placement Algorithm

32
CPU Placement Algorithm
 k=1 in the experiments
 k can be increased at the expense of performance
 Desirable to have Lower ‘c’
 Less contention
 Better Network Performance
 Location of pillars predetermined
 Pillars should be as far as possible to reduce
congested areas
 Not in edges, because, this will limit number of cache
banks around the pillars
 Placement pattern spans 4 layers beyond which they
are repeated
 Thermal effects reduces with inter layer distance

33
Thermal Aware CPU Placement

34
Thermal Profile – Hotspots – HS3d

35
3D L2 Cache Management
 Cache banks are divided in clusters
 Cluster contains a set of cache banks
 Separate tag array for all cache lines in the
cluster
 All banks in a cluster are connected by an NoC
 Tag array has direct connection to processor
array
 Clusters without local processors have
customized logic block
 for receiving cache requests
 Searching tag array
 Forwarding request to target cache bank

36
Cache Management Policies
 Cache Line Search
 Cache Placement
 Cache Replacement
 Cache Line Migration

37
Cache Line Search
 Two Step Process
 (1) Processor searches local tag array in the
cluster
 Also sends request to neighbors (also the
vertical neighbors through the pillars)
 (2) If not found in any of these places, processor
multicasts the request to remaining clusters
 If tag match fails in all clusters, it is considered
an L2 Miss
 If there is a match, the corresponding data is
routed to requesting processor through NoC

38
Placement and Replacement
 Lower Order Bits of the Cache tag
indicate the cluster
 Lower order bits of cache index
indicate the bank
 Remaining bits indicate the precise
location in the bank
 Pseudo LRU policy is used for
replacement

39
Cache Line Migration
 Intra Layer Data Migration
 Data is migrated to cluster close to accessing
CPU
 Clusters that have processors are skipped
 This is done to prevent any effects on L2 access
patterns of the local CPU in that cluster
 Eventually data migrates to the cluster of the
processor
 Because of repeated access

40
Intra Layer Data Migration

41
Cache Line Migration
 Inter Layer Data Migration
 Data is migrated closer to the pillar near
the accessing CPU.
 Assumption – Clusters near the same
pillar in different layers are considered
local.
 No inter layer data migration
 This also helps reduce power.

42
Inter Layer Data Migration

43
Cache Line Migration
 Lazy Migration
 To prevent False Misses
 False misses are those that are caused by
searches for data in Migration
 False Misses occur because of repeated
access of a few “hot” blocks by multiple
processors.
 Solution – Delay Migration by a few cycles
 Cancel Migration when a different Processor
accesses the same block
44
Experiment Methodology
 Simics with 3D NoC Simulator
 8 Processor CMP
 Solaris 9
 In order issue, SPARC ISA
 Private L1 Cache, shared large L2
 Cacti 3.2
 dTDMA was integrated into 2D NoC
Simulator as the vertical channel
 L1 Cache Coherence traffic was taken into
account
45
System Configuration

46
Benchmarks
 Application was run 500 million cycles for
L2 warm up
 Statistics were collected for next 2 billion
cycles
 Table shows L2 cache accesses

47
Results
 Legend
 CMP-DNUCA – Conventional perfect
search
 CMP-DNUCA-2D - CMP-DNUCA-3D with a
single layer
 CMP-DNUCA-3D – Proposed architecture
with data migration
 CMP-SNUCA-3D – Proposed architecture
without data migration

48
Average L2 Hit Latency

49
Number of block Migrations

50
IPC

51
Average L2 Hit Latency under
different cache sizes

52
Effect of Number of Pillars

53
Impact of Number of Layers

54
Conclusion
 3D NoC architecture reduces average L2
access latency
 This improves IPC
 3D is better than 2D
 even without data migration
 Placement of processors in 3D needs to
consider thermal issues carefully
 Number of pillars should be chosen
carefully
 This in turn can affect congestion and bandwidth

55
Strength
 Novel Architecture
 Solves access time issues
 Includes Thermal issues
 And tries to mitigate the same by proper
CPU placement
 Hybrid Network-in-Memory
 Router + bus
 Adopts dTDMA for efficient channel usage

56
Weakness
 The paper assumes one CPU per pillar
 As the number of CPU’s increase, this assumption
may not be true
 Via density does not increase and hence number of
pillars are fixed
 So CPU’s may have to share pillars
 The paper does not discuss the effect on L2 Latency
because of this sharing of pillars.
 Assumes a single stage router
 May not be always practical/feasible
 Thermal Aware CPU Placement
 What is the assumption on the heat flow.
 Uniform ? Or not ?

57
Things I did not understand
 MLBS

58
Next Paper can be
 L2 performance degradation because
of this sharing of pillars
 Face to Face wafer bonding
 Back-to-face results in more wasted area
 MLBS
 Effect of Router Speed
 Paper assumed single cycle for all four
stages

59
Questions 

Thank you

60

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy