0% found this document useful (0 votes)

44 views3 pages

Organisasi & Arsitektur Komputer

This document discusses using data-level parallelism through vectorization to improve the performance of operations that are repeated on multiple data elements. It provides examples of extending an instruction set architecture with vector storage and operations like load vector, add two vectors, and store vector. Implementing vector instructions can improve performance by executing the same operation on multiple data elements with a single instruction. Popular vector instruction sets include Intel SSE and AMD 3DNow. Programmers can take advantage of vectors by writing intrinsic functions, using vector libraries, or relying on compiler auto-vectorization.

Uploaded by

satuytm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views3 pages

Organisasi & Arsitektur Komputer

Uploaded by

satuytm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Best Way to Compute This Fast?

•! Sometimes you want to perform the same operations on

many data items for (I = 0; I < 1024; I++)
Z[I] = A*X[I] + Y[I];
•! Surprise example: SAXPY
CIS 501 0: ldf X(r1),f1 // I is in r1
mulf f0,f1,f2 // A is in f0
Computer Architecture ldf Y(r1),f3
addf f2,f3,f4
stf f4,Z(r1)
addi r1,4,r1
blti r1,4096,0

Unit 12: Vectors •! One approach: superscalar (instruction-level parallelism)

•! Loop unrolling with static scheduling –or– dynamic scheduling
•! Problem: wide-issue superscalar scaling issues
–! N2 bypassing, N2 dependence check, wide fetch
–! More register file & memory traffic (ports)
•! Can we do better?
CIS 501 (Martin/Roth): Vectors 1 CIS 501 (Martin/Roth): Vectors 2

Better Alternative: Data-Level Parallelism Example Vector ISA Extensions

•! Data-level parallelism (DLP) •! Extend ISA with floating point (FP) vector storage …
•! Single operation repeated on multiple data elements •! Vector register: fixed-size array of 32- or 64- bit FP elements
•! SIMD (Single-Instruction, Multiple-Data) •! Vector length: For example: 4, 8, 16, 64, …
•! Less general than ILP: parallel insns are all same operation
•! Exploit with vectors •! … and example operations for vector length of 4
•! Load vector: ldf.v X(r1),v1
•! Old idea: Cray-1 supercomputer from late 1970s ldf X+0(r1),v1[0]
ldf X+1(r1),v1[1]
•! Eight 64-entry x 64-bit floating point “Vector registers”
ldf X+2(r1),v1[2]
•! 4096 bits (0.5KB) in each register! 4KB for vector register file
ldf X+3(r1),v1[3]
•! Special vector instructions to perform vector operations
•! Add two vectors: addf.vv v1,v2,v3
•! Load vector, store vector (wide memory operation)
addf v1[i],v2[i],v3[i] (where i is 0,1,2,3)
•! Vector+Vector addition, subtraction, multiply, etc.
•! Add vector to scalar: addf.vs v1,f2,v3
•! Vector+Constant addition, subtraction, multiply, etc.
addf v1[i],f2,v3[i] (where i is 0,1,2,3)
•! In Cray-1, each instruction specifies 64 operations!
CIS 501 (Martin/Roth): Vectors 3 CIS 501 (Martin/Roth): Vectors 4
Example Use of Vectors – 4-wide Vector Datapath & Implementatoin
ldf X(r1),f1 ldf.v X(r1),v1
mulf f0,f1,f2 mulf.vs v1,f0,v2 •! Vector insn. are just like normal insn… only “wider”
ldf Y(r1),f3 ldf.v Y(r1),v3 •! Single instruction fetch (no extra N2 checks)
addf f2,f3,f4 addf.vv v2,v3,v4
stf f4,Z(r1) stf.v v4,Z(r1) •! Wide register read & write (not multiple ports)
addi r1,4,r1 addi r1,16,r1
blti r1,4096,0 blti r1,4096,0 •! Wide execute: replicate floating point unit (same as superscalar)
7x1024 instructions 7x256 instructions •! Wide bypass (avoid N2 bypass problem)
•! Operations (4x fewer instructions) •! Wide cache read & write (single cache tag check)
•! Load vector: ldf.v X(r1),v1
•! Multiply vector to scalar: mulf.vs v1,f2,v3 •! Execution width (implementation) vs vector width (ISA)
•! Add two vectors: addf.vv v1,v2,v3 •! Example: Pentium 4 and “Core 1” executes vector ops at half width
•! Store vector: stf.v v1,X(r1) •! “Core 2” executes them at full width
•! Performance? •! Because they are just instructions…
•! If CPI is one, 4x speedup •! …superscalar execution of vector instructions is common
•! But, vector instructions don’t always have single-cycle throughput •! Multiple n-wide vector instructions per cycle
•! Execution width (implementation) vs vector width (ISA)
CIS 501 (Martin/Roth): Vectors 5 CIS 501 (Martin/Roth): Vectors 6

Intel’s SSE2/SSE3/SSE4… Other Vector Instructions

•! Intel SSE2 (Streaming SIMD Extensions 2) - 2001 •! These target specific domains: e.g., image processing, crypto
•! 16 128bit floating point registers (xmm0–xmm15) •! Vector reduction (sum all elements of a vector)
•! Each can be treated as 2x64b FP or 4x32b FP (“packed FP”) •! Geometry processing: 4x4 translation/rotation matrices
•! Or 2x64b or 4x32b or 8x16b or 16x8b ints (“packed integer”) •! Saturating (non-overflowing) subword add/sub: image processing
•! Or 1x64b or 1x32b FP (just normal scalar floating point) •! Byte asymmetric operations: blending and composition in graphics
•! Original SSE: only 8 registers, no packed integer support •! Byte shuffle/permute: crypto
•! Population (bit) count: crypto
•! Other vector extensions •! Max/min/argmax/argmin: video codec
•! AMD 3DNow!: 64b (2x32b) •! Absolute differences: video codec
•! PowerPC AltiVEC/VMX: 128b (2x64b or 4x32b) •! Multiply-accumulate: digital-signal processing
•! More advanced (but in Intel’s Larrabee)
•! Looking forward for x86 •! Scatter/gather loads: indirect store (or load) from a vector of pointers
•! Intel’s “Sandy Bridge” will bring 256-bit vectors to x86 •! Vector mask: predication (conditional execution) of specific elements
•! Intel’s “Larrabee” graphics chip will bring 512-bit vectors to x86
CIS 501 (Martin/Roth): Vectors 7 CIS 501 (Martin/Roth): Vectors 8
Using Vectors in Your Code
•! Write in assembly
•! Ugh

•! Use “intrinsic” functions and data types

•! For example: _mm_mul_ps() and “__m128” datatype

•! Use a library someone else wrote

•! Let them do the hard work
•! Matrix and linear algebra packages

•! Let the compiler do it (automatic vectorization)

•! GCC’s “-ftree-vectorize” option
•! Doesn’t yet work well for C/C++ code (old, very hard problem)
CIS 501 (Martin/Roth): Vectors 9

Vector
No ratings yet
Vector
42 pages
Data-Level Parallelism: Nima Honarmand
No ratings yet
Data-Level Parallelism: Nima Honarmand
59 pages
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
No ratings yet
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
37 pages
Altivec Programming
No ratings yet
Altivec Programming
67 pages
Language Reference: IBM XL C Enterprise Edition V8.0 For AIX
No ratings yet
Language Reference: IBM XL C Enterprise Edition V8.0 For AIX
226 pages
Module 1.6
No ratings yet
Module 1.6
53 pages
CS6461 - Computer Architecture Fall 2016 - Vector Operations
No ratings yet
CS6461 - Computer Architecture Fall 2016 - Vector Operations
47 pages
Lec. 12: Vector Computers: EECS 252 Graduate Computer Architecture
No ratings yet
Lec. 12: Vector Computers: EECS 252 Graduate Computer Architecture
31 pages
Using The GNU Compiler Collection: Richard M. Stallman and The GCC Developer Community
No ratings yet
Using The GNU Compiler Collection: Richard M. Stallman and The GCC Developer Community
1,022 pages
Vector
No ratings yet
Vector
38 pages
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
58 pages
gcc-9 1 0
No ratings yet
gcc-9 1 0
992 pages
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
28 pages
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
No ratings yet
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
39 pages
Architecture Chapter4 E5 2012
No ratings yet
Architecture Chapter4 E5 2012
92 pages
Chapter 8
No ratings yet
Chapter 8
59 pages
Slide 7
No ratings yet
Slide 7
40 pages
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
No ratings yet
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
31 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
16 pages
Riscv V Spec 1.0 Rc2
No ratings yet
Riscv V Spec 1.0 Rc2
112 pages
Data-Level Parallelism Vector and GPU
No ratings yet
Data-Level Parallelism Vector and GPU
6 pages
Migration Using Oracle GoldenGate
No ratings yet
Migration Using Oracle GoldenGate
19 pages
Computer Architecture Simd Vector Gpu
No ratings yet
Computer Architecture Simd Vector Gpu
16 pages
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
CH 04. Data-Level Parallelism in Vector, SIMD, and GPU Architectures
50 pages
1 Vector Processing: Solutions
No ratings yet
1 Vector Processing: Solutions
16 pages
Vector Processor
No ratings yet
Vector Processor
83 pages
Simple Vector Processor Modeled With VHDL
No ratings yet
Simple Vector Processor Modeled With VHDL
6 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
Syllabus Topic: - Vector Processing - Vector Processor
No ratings yet
Syllabus Topic: - Vector Processing - Vector Processor
14 pages
Power PC
No ratings yet
Power PC
50 pages
Power
No ratings yet
Power
6 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
Computer Architecture AllClasses-Outline-199-294
No ratings yet
Computer Architecture AllClasses-Outline-199-294
96 pages
XX-BSC Compact Vector Processing
No ratings yet
XX-BSC Compact Vector Processing
49 pages
26-27 SIMD Architecture
No ratings yet
26-27 SIMD Architecture
33 pages
Vector Code Example
No ratings yet
Vector Code Example
6 pages
Lecture8 Simd
No ratings yet
Lecture8 Simd
38 pages
Diab Relnote
No ratings yet
Diab Relnote
33 pages
Unit Iii - Aca
No ratings yet
Unit Iii - Aca
13 pages
Using The GNU Compiler Collection: Richard M. Stallman and The GCC Developer Community
No ratings yet
Using The GNU Compiler Collection: Richard M. Stallman and The GCC Developer Community
1,195 pages
15 20-15 55-18 05 06 VEXT-bcn-v1
No ratings yet
15 20-15 55-18 05 06 VEXT-bcn-v1
76 pages
Eigen: A C++ Linear Algebra Template Library: MD Ashiqur Rahman
No ratings yet
Eigen: A C++ Linear Algebra Template Library: MD Ashiqur Rahman
20 pages
Ca Part 3
No ratings yet
Ca Part 3
20 pages
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
No ratings yet
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
57 pages
l22 Vector
No ratings yet
l22 Vector
32 pages
17.40 Vector - RISCV 20190611 Vectors
No ratings yet
17.40 Vector - RISCV 20190611 Vectors
26 pages
Auto-Vectorization With The Intel Compilers: Is Your Code Ready For Sandy Bridge and Knights Corner?
No ratings yet
Auto-Vectorization With The Intel Compilers: Is Your Code Ready For Sandy Bridge and Knights Corner?
12 pages
Algorithms and Architectures For 2D Discrete Wavelet Transform
No ratings yet
Algorithms and Architectures For 2D Discrete Wavelet Transform
20 pages
16-Bit Floating Point Instructions For Embedded Multimedia Applications
No ratings yet
16-Bit Floating Point Instructions For Embedded Multimedia Applications
6 pages
Microprocessor Architectures: 1.1 Intel 1.2 Motorola
No ratings yet
Microprocessor Architectures: 1.1 Intel 1.2 Motorola
14 pages
Riscv Vector Workshop June2015
No ratings yet
Riscv Vector Workshop June2015
58 pages
CA Classes-201-205
No ratings yet
CA Classes-201-205
5 pages
Resource-Efficient RISC-V Vector Extension Architecture For FPGA-based Accelerator
No ratings yet
Resource-Efficient RISC-V Vector Extension Architecture For FPGA-based Accelerator
8 pages
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
64 pages
WINSEM2022-23 CSE4001 ETH VL2022230503160 Reference Material I 05-01-2023 2.3 SIMD VP
No ratings yet
WINSEM2022-23 CSE4001 ETH VL2022230503160 Reference Material I 05-01-2023 2.3 SIMD VP
25 pages
Chapter 04
No ratings yet
Chapter 04
47 pages
AN4094.Build Tools Settings - IDE Vs CMD Line
No ratings yet
AN4094.Build Tools Settings - IDE Vs CMD Line
22 pages
Vector Processor
No ratings yet
Vector Processor
13 pages
Using The GNU Compiler Collection: Richard M. Stallman and The Developer Community
No ratings yet
Using The GNU Compiler Collection: Richard M. Stallman and The Developer Community
1,214 pages
SIMD
No ratings yet
SIMD
44 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Guc 315 61 38694 2023-11-23T11 50 52
No ratings yet
Guc 315 61 38694 2023-11-23T11 50 52
33 pages
Array & Vector Processor
No ratings yet
Array & Vector Processor
17 pages
COE4590 14 Vector
No ratings yet
COE4590 14 Vector
14 pages
Module 4 Chapter 2
No ratings yet
Module 4 Chapter 2
42 pages
Unit 2
No ratings yet
Unit 2
43 pages
Why Vector Processing: Deep Pipeline More Parallelism
No ratings yet
Why Vector Processing: Deep Pipeline More Parallelism
7 pages
MCA - HW - Lecture 7and8 - Prelim
No ratings yet
MCA - HW - Lecture 7and8 - Prelim
146 pages
7TH - Unit 4-21ec74h6 - Ca
No ratings yet
7TH - Unit 4-21ec74h6 - Ca
67 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
Vsearch Manual
No ratings yet
Vsearch Manual
58 pages
Building An Xbox 360 Emulator
No ratings yet
Building An Xbox 360 Emulator
28 pages
19 Computer Architecture Vector Processor
No ratings yet
19 Computer Architecture Vector Processor
20 pages
Unit 3-4
No ratings yet
Unit 3-4
76 pages
Assembly #4
No ratings yet
Assembly #4
3 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Organisasi & Arsitektur Komputer

Uploaded by

Organisasi & Arsitektur Komputer

Uploaded by

Best Way to Compute This Fast?

•! Sometimes you want to perform the same operations on

Unit 12: Vectors •! One approach: superscalar (instruction-level parallelism)

Better Alternative: Data-Level Parallelism Example Vector ISA Extensions

Intel’s SSE2/SSE3/SSE4… Other Vector Instructions

•! Use “intrinsic” functions and data types

•! Use a library someone else wrote

•! Let the compiler do it (automatic vectorization)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.