0% found this document useful (0 votes)
44 views3 pages

Organisasi & Arsitektur Komputer

This document discusses using data-level parallelism through vectorization to improve the performance of operations that are repeated on multiple data elements. It provides examples of extending an instruction set architecture with vector storage and operations like load vector, add two vectors, and store vector. Implementing vector instructions can improve performance by executing the same operation on multiple data elements with a single instruction. Popular vector instruction sets include Intel SSE and AMD 3DNow. Programmers can take advantage of vectors by writing intrinsic functions, using vector libraries, or relying on compiler auto-vectorization.

Uploaded by

satuytm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views3 pages

Organisasi & Arsitektur Komputer

This document discusses using data-level parallelism through vectorization to improve the performance of operations that are repeated on multiple data elements. It provides examples of extending an instruction set architecture with vector storage and operations like load vector, add two vectors, and store vector. Implementing vector instructions can improve performance by executing the same operation on multiple data elements with a single instruction. Popular vector instruction sets include Intel SSE and AMD 3DNow. Programmers can take advantage of vectors by writing intrinsic functions, using vector libraries, or relying on compiler auto-vectorization.

Uploaded by

satuytm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Best Way to Compute This Fast?

•! Sometimes you want to perform the same operations on


many data items for (I = 0; I < 1024; I++)
Z[I] = A*X[I] + Y[I];
•! Surprise example: SAXPY
CIS 501 0: ldf X(r1),f1 // I is in r1
mulf f0,f1,f2 // A is in f0
Computer Architecture ldf Y(r1),f3
addf f2,f3,f4
stf f4,Z(r1)
addi r1,4,r1
blti r1,4096,0

Unit 12: Vectors •! One approach: superscalar (instruction-level parallelism)


•! Loop unrolling with static scheduling –or– dynamic scheduling
•! Problem: wide-issue superscalar scaling issues
–! N2 bypassing, N2 dependence check, wide fetch
–! More register file & memory traffic (ports)
•! Can we do better?
CIS 501 (Martin/Roth): Vectors 1 CIS 501 (Martin/Roth): Vectors 2

Better Alternative: Data-Level Parallelism Example Vector ISA Extensions


•! Data-level parallelism (DLP) •! Extend ISA with floating point (FP) vector storage …
•! Single operation repeated on multiple data elements •! Vector register: fixed-size array of 32- or 64- bit FP elements
•! SIMD (Single-Instruction, Multiple-Data) •! Vector length: For example: 4, 8, 16, 64, …
•! Less general than ILP: parallel insns are all same operation
•! Exploit with vectors •! … and example operations for vector length of 4
•! Load vector: ldf.v X(r1),v1
•! Old idea: Cray-1 supercomputer from late 1970s ldf X+0(r1),v1[0]
ldf X+1(r1),v1[1]
•! Eight 64-entry x 64-bit floating point “Vector registers”
ldf X+2(r1),v1[2]
•! 4096 bits (0.5KB) in each register! 4KB for vector register file
ldf X+3(r1),v1[3]
•! Special vector instructions to perform vector operations
•! Add two vectors: addf.vv v1,v2,v3
•! Load vector, store vector (wide memory operation)
addf v1[i],v2[i],v3[i] (where i is 0,1,2,3)
•! Vector+Vector addition, subtraction, multiply, etc.
•! Add vector to scalar: addf.vs v1,f2,v3
•! Vector+Constant addition, subtraction, multiply, etc.
addf v1[i],f2,v3[i] (where i is 0,1,2,3)
•! In Cray-1, each instruction specifies 64 operations!
CIS 501 (Martin/Roth): Vectors 3 CIS 501 (Martin/Roth): Vectors 4
Example Use of Vectors – 4-wide Vector Datapath & Implementatoin
ldf X(r1),f1 ldf.v X(r1),v1
mulf f0,f1,f2 mulf.vs v1,f0,v2 •! Vector insn. are just like normal insn… only “wider”
ldf Y(r1),f3 ldf.v Y(r1),v3 •! Single instruction fetch (no extra N2 checks)
addf f2,f3,f4 addf.vv v2,v3,v4
stf f4,Z(r1) stf.v v4,Z(r1) •! Wide register read & write (not multiple ports)
addi r1,4,r1 addi r1,16,r1
blti r1,4096,0 blti r1,4096,0 •! Wide execute: replicate floating point unit (same as superscalar)
7x1024 instructions 7x256 instructions •! Wide bypass (avoid N2 bypass problem)
•! Operations (4x fewer instructions) •! Wide cache read & write (single cache tag check)
•! Load vector: ldf.v X(r1),v1
•! Multiply vector to scalar: mulf.vs v1,f2,v3 •! Execution width (implementation) vs vector width (ISA)
•! Add two vectors: addf.vv v1,v2,v3 •! Example: Pentium 4 and “Core 1” executes vector ops at half width
•! Store vector: stf.v v1,X(r1) •! “Core 2” executes them at full width
•! Performance? •! Because they are just instructions…
•! If CPI is one, 4x speedup •! …superscalar execution of vector instructions is common
•! But, vector instructions don’t always have single-cycle throughput •! Multiple n-wide vector instructions per cycle
•! Execution width (implementation) vs vector width (ISA)
CIS 501 (Martin/Roth): Vectors 5 CIS 501 (Martin/Roth): Vectors 6

Intel’s SSE2/SSE3/SSE4… Other Vector Instructions


•! Intel SSE2 (Streaming SIMD Extensions 2) - 2001 •! These target specific domains: e.g., image processing, crypto
•! 16 128bit floating point registers (xmm0–xmm15) •! Vector reduction (sum all elements of a vector)
•! Each can be treated as 2x64b FP or 4x32b FP (“packed FP”) •! Geometry processing: 4x4 translation/rotation matrices
•! Or 2x64b or 4x32b or 8x16b or 16x8b ints (“packed integer”) •! Saturating (non-overflowing) subword add/sub: image processing
•! Or 1x64b or 1x32b FP (just normal scalar floating point) •! Byte asymmetric operations: blending and composition in graphics
•! Original SSE: only 8 registers, no packed integer support •! Byte shuffle/permute: crypto
•! Population (bit) count: crypto
•! Other vector extensions •! Max/min/argmax/argmin: video codec
•! AMD 3DNow!: 64b (2x32b) •! Absolute differences: video codec
•! PowerPC AltiVEC/VMX: 128b (2x64b or 4x32b) •! Multiply-accumulate: digital-signal processing
•! More advanced (but in Intel’s Larrabee)
•! Looking forward for x86 •! Scatter/gather loads: indirect store (or load) from a vector of pointers
•! Intel’s “Sandy Bridge” will bring 256-bit vectors to x86 •! Vector mask: predication (conditional execution) of specific elements
•! Intel’s “Larrabee” graphics chip will bring 512-bit vectors to x86
CIS 501 (Martin/Roth): Vectors 7 CIS 501 (Martin/Roth): Vectors 8
Using Vectors in Your Code
•! Write in assembly
•! Ugh

•! Use “intrinsic” functions and data types


•! For example: _mm_mul_ps() and “__m128” datatype

•! Use a library someone else wrote


•! Let them do the hard work
•! Matrix and linear algebra packages

•! Let the compiler do it (automatic vectorization)


•! GCC’s “-ftree-vectorize” option
•! Doesn’t yet work well for C/C++ code (old, very hard problem)
CIS 501 (Martin/Roth): Vectors 9

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy