l22 Vector
l22 Vector
Vector Computers
Joel Emer
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Supercomputers
Definition of a supercomputer:
• Fastest machine in world at given task
• A device to turn a compute-bound problem into an I/O
bound problem
• Any machine costing $30M+
• Any machine designed by Seymour Cray
Supercomputer Applications
Vector Supercomputers
Epitomized by Cray-1, 1976:
• Scalar Unit
– Load/Store Architecture
• Vector Extension
– Vector Registers
– Vector Instructions
• Implementation
– Hardwired Control
– Highly Pipelined Functional Units
– Interleaved Memory System
– No Data Caches
– No Virtual Memory
Joel Emer
November 30, 2005
V0 Vi V. Mask
V1
V2 Vj
64 Element V3 V. Length
Vector Registers V4 Vk
Single Port V5
V6
Memory V7
FP Add
S0 Sj FP Mul
16 banks of ( (Ah) + j k m ) S1
S2 Sk FP Recip
64-bit words Si S3
(A0) 64 S4 Si Int Add
+ Tjk S5
T Regs S6
8-bit SECDED S7
Int Logic
Int Shift
A0
80MW/sec data ( (Ah) + j k m ) A1 Pop Cnt
A2
load/store Ai A3
Aj
(A0) 64 A4 Ak Addr Add
Bjk A5
Ai
320MW/sec B Regs A6 Addr Mul
A7
instruction
buffer refill NIP CIP
64-bitx16
LIP
4 Instruction Buffers
r0 v0
[0] [1] [2] [VLRMAX-1]
Vector Length Register VLR
v1
Vector Arithmetic v2
Instructions + + + + + +
ADDV v3, v1, v2 v3
[0] [1] [VLR-1]
Memory
Base, r1 Stride, r2
Joel Emer
November 30, 2005
6.823, L22-9
V3 <- v1 * v2
Joel Emer
November 30, 2005
ADDV C,A,B
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
Base Stride
Vector Registers
Address
Generator
+
0 1 2 3 4 5 6 7 8 9 A B C D E F
Memory Banks
Joel Emer
November 30, 2005
6.823, L22-14
Vector
Registers
Elements Elements Elements Elements
0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, …
Lane
Memory Subsystem
Joel Emer
November 30, 2005
Instruction
issue
V V V V V
LV v1
1 2 3 4 5
MULV v3,v1,v2
ADDV v5, v3, v4
Chain Chain
Load
Unit
Mult. Add
Memory
Joel Emer
November 30, 2005
6.823, L22-18
Load
Mul
Time Add
Load
Mul
Add
Joel Emer
R X X X W
R X X X W
R X X X W
R X X X W
Dead Time
R X X X W
R X X X W
R X X X W
Dead Time R X X X W
Second Vector Instruction
R X X X W
Joel Emer
November 30, 2005
6.823, L22-20
No dead time
64 cycles active
load
Iter Iter
Iter. 2 load .1 .2 Vector Instruction
add
Vectorization is a massive compile-time
reordering of operation sequencing
⇒ requires extensive loop dependence analysis
store
Joel Emer
Vector Scatter/Gather
Vector Scatter/Gather
Scatter example:
for (i=0; i<N; i++)
A[B[i]]++;
M[2]=0 C[4]
M[1]=1
M[2]=0 C[2]
M[0]=0
M[1]=1 C[1] C[1]
M[0]=0 C[0]
Compress/Expand Operations
• Compress packs non-masked elements from one vector register
contiguously at start of destination vector register
– population count of mask vector gives packed vector length
• Expand performs inverse operation
Compress Expand
Vector Reductions
• CMOS Technology
– 500 MHz CPU, fits on single chip
– SDRAM main memory (up to 64GB)
• Scalar unit
Image removed due
– 4-way superscalar with out-of-order and speculative
execution to copyright
restrictions.
– 64KB I-cache and 64KB data cache
Image available in
• Vector unit Kitagawa, K., S.
– 8 foreground VRegs + 64 background VRegs (256x64- Tagaya, Y. Hagihara,
bit elements/VReg) and Y. Kanoh. "A
– 1 multiply unit, 1 divide unit, 1 add/shift unit, 1 logical hardware overview of
unit, 1 mask unit SX-6 and SX-7
– 8 lanes (8 GFLOPS peak, 16 FLOPS/cycle) supercomputer." NEC
– 1 load & store unit (32x8 byte accesses/cycle) Research &
Development Journal
– 32 GB/s memory bandwidth per processor
44, no. 1 (Jan
• SMP structure 2003):2-7.
– 8 CPUs connected to memory through crossbar
– 256 GB/s shared memory bandwidth (4096 interleaved
banks)
Joel Emer
November 30, 2005
6.823, L22-32
Multimedia Extensions