0% found this document useful (0 votes)
16 views65 pages

ELECH473 Th06

slide on microprocessor architectures

Uploaded by

debby
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views65 pages

ELECH473 Th06

slide on microprocessor architectures

Uploaded by

debby
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

ELEC-H-473 – Microprocessor architecture

Th06: Data parallelism – SIMD


Dragomir Milojevic
Université libre de Bruxelles
2023
Previously
• Instruction Level Parallelism (ILP) introduced to allow parallel
execution of different instruction phases to improve throughput
• Execution hazards reduce instruction throughput, different HW
techniques proposed to avoid pipeline stalls that cause
performance loss; but we need to addressed these in SW too!
• Memory hierarchy with chain of smaller, but faster SRAM
memories has been added to compensate for speed difference
between cores & central (DRAM) memory; memory management
techniques introduced to improve data & instructions throughput
• Super-scalar architectures with multiple ALUs for multiple Ex
steps in the same clock cycle; plus Load/Store units for
overlapped data transfers & computations → parallelism
• But data could be processed even more in parallel ...

ELEC-H-473 Th06 2/65


Today
1. Data parallelism: motivation

2. Parallel processing: old classification, still up-to-date

3. SIMD extensions in Intel CPUs

4. Identifying the CPU

5. Memory alignment

6. SIMD programming for Intel CPUs

7. Automated compiler vectorization


1. Data parallelism: motivation

ELEC-H-473 Th06 4/65


On ALU and data types
• ALUs are designed to be efficient on computations for most
commonly used data types for various applications
• Common data types are integers and floating point of different
bit-widths, could be anything between 8 and 128 bits
• ALU complexity (silicon area) depends on the operand width, so
the CPU architects will trade-off: ALU perf. vs. area (i.e. cost)
• If we need simple computations most of the time, 16-bit ALU(s)
may be enough; this is what you have in simple CPUs, micro
controllers etc. note low-power (mobile) CPUs are already 64-bit
• If we use a CPU with a simple ALUs & if application requires
bigger operands occasionally, the ALU could support that at the
expense of performance: the computation on more complex
operands is decomposed into a sequence of smaller operations, so
these operations could take few cycles to complete
ELEC-H-473 Th06 5/65
GPP ALUs
• General Purpose CPUs have ALUs designed to accept largest
data type possible & therefore can use any smaller data type
• This is possible thanks to better IC integration (CMOS scaling)
• Doubling of minimum operand size over the years: from 4 bits in
first CPUs to 64 bits today, and more (we will see...)
• Let’s look into the inverse problem:
. What happens with execution efficiency for smaller data types?
• Take an example of a 32-bit ALU doing operations on two
unsigned integers encoded with 8 bits:
Operation CPU
Min : 0x00 0x000000B5
Max : 0xFF + 0x00000011

6 hex digits (24-bits out of 32) on the left are not unused!
ELEC-H-473 Th06 6/65
How inefficient this is?
• Consider following program & assume more realistic 64-bit ALU:
 
1 unsigned char A, B, C;
2 int i;
3 for (i = 0 ; i < 1000000; i++) {
4 C[i] = A[i] + B[i];
5 }
 

• ALU usage efficiency will be only 12.5%


• What will happen if we assume 128-bit ALU and same data type?
• Computations over smaller data types using ALUs designed to
work for wide data types are highly inefficient ...
• And it is not only about performance, power efficiency is poor too:
. What will happen in code above if A, B, C were signed integers?

ELEC-H-473 Th06 7/65


Solution: pack more data to use all 64 bits
• We could do the following using super-scalar CPU with 4 ALUs: 
1 unsigned char A, B, C; // data type here is a single byte
2 int i;
3 for (I = 0 ; I<1000000; I+=4) {
4 C[I+0] = A[I+0] + B[I+0]; // 0xB5 + 0x11 for I=0
5 C[I+1] = A[I+1] + B[I+1]; // 0xA0 + 0xAB for I=0
6 C[I+2] = A[I+2] + B[I+2]; // 0xBB + 0xFF for I=0
7 C[I+3] = A[I+3] + B[I+3]; // 0x10 + 0x30 for I=0
8 }
 
and compute 4 sums in parallel in each iteration! these are
independent data, so this is doable, remember loop unrolling
• Even better: let’s do the computation with a single 64-bit ALU!
• How? Pack data, use same ALU & stop propagating carry; the
above first iteration can be done in one Ex cycle
Bits 31-24 23-16 15-08 07-00
Iter. i+3 i+2 i+1 i+0
Op.1 0x10 0xBB 0xA0 0xB5
Op.2 0x30 0xFF 0xAB 0x11
ELEC-H-473 Th06 8/65
This solution has benefits ...
• Better computing performance
. Sets of data are vectors: we speak of vector processing
. One could design vector ALUs, or if we have wide operand ALU,
smaller data types AND need to (or can) do vector operations, we
could modify our ALU so that it can work as vector processor
. This is also known as sub-word parallelism: we modify the ALU to
work on vectors of smaller operands; one Ex operation still done in
one clock cycle, but over the vector of values!
• Better memory access
. Instead of reading individual vector elements we read sets of data;
not only we compute faster, but we transfer more efficiently data
from memory to Register File (& the other way)
. Bus transfers between central-memory & cache and caches & CPU,
are not 8 bits transfers but more than that (64 bits between DIMM
and CPU packages); atomic transfers should use this amount of data
. But memory access have some issues that we will cover in Section 5

ELEC-H-473 Th06 9/65


2. Parallel processing: old classification, still
up-to-date

ELEC-H-473 Th06 10/65


Sequential processing

• In our simplified computer architecture model we assumed that


the ALU exhibits bit-level parallelism; that is all bits in operands
will be processed in parallel, concurrently
• Thus, any elementary arithmetic/logic operation will be executed
in parallel for all bits in the arguments, and would take let’s say 1
Clk cycle to complete (this may vary, but let’s keep it simple)
• To keep the complexity of the ALU low, the number of operands
is limited to 2 or 3 arguments, rarely more
• Complex operations and/or computations on more operands, are
done using a sequence of simple computations
• We speak of sequential execution, instructions are executed on
limited set of data, one after the other in time (Turing-machine)

ELEC-H-473 Th06 11/65


Parallel processing
• As opposed to sequential processing, if we do some computations
concurrently we can speak of parallel processing
• Because instruction execution is a lengthy process, it is
decomposed in steps (F, D, Ex, W) and they are executed in a
pipeline → Instruction-Level Parallelism (ILP); once pipeline is
full, 4 steps are executed in parallel for different instructions
• This is yet another execution in parallel (other then ALU
operating in parallel on operand bits)
• We also added multiple execution units to increase the parallelism
and allow data independent instructions to be computed in
parallel – super-scalar
• There are TOO many parallel things going on, so let’s try to put
some order in it ...

ELEC-H-473 Th06 12/65


Options for parallel computations
• Sequential vs. parallel computation: how Data
to classify computer architectures? Memory
• Look at Harvard architecture & the
number of occurrences of instruction &
data memories (duplicated) CPU

• How many instruction & data are handled ALU RF


at each execution cycle of our CTRL
architecture: could be single or multiple
• So, instructions & data could be single or
Instruction
multiple → there are 4 combinations all
Memory
• Michael Flynn did this in 1972, to classify
different computing systems; his IO
classification (or taxonomy) is still in use

ELEC-H-473 Th06 13/65


Flynn’s classification – four combinations
• Single Instruction Single Data (SISD)
At each clock cycle only one instruction executes over one pair
of operands; this is what we already have in our model
• Single Instruction Multiple Data (SIMD)
At each clock cycle one instruction executes over sets of
operands; this is vector processing, sub-word parallelism we
spoke about in previous section
• Multiple Instruction Single Data (MISD)
At each clock cycle multiple instruction work on same data
• Multiple Instruction Multiple Data (MIMD)
At each clock cycle multiple instruction can execute over sets
of operands

• Any computing system today will fall into 1 of these 4 categories

ELEC-H-473 Th06 14/65


Flynn’s classification – practical systems today
• SISD – most common, versatile since they can compute anything
that is computable (sequence of computation, c.f. Turing);
trade-off execution time for system complexity; they depend
heavily on system speed, and thus CMOS, to deliver performance
• SIMD – very common in ’70 and become popular in past decades:
array processors, Graphical Processing Units (GPUs), extensions
to traditional CPUs; omnipresent today in high-performance, but
also low-power CPUs
• MISD – specific machines, not that many examples; good for
fault tolerance (Space Shuttle) or systolic arrays (used for Deep
Neural Networks, though not really pure MISD)
• MIMD – Any multi, many core/CPU system from desktop
computer to computing cluster to internet based computation,
data centers, cloud computing, super-computing etc.
ELEC-H-473 Th06 15/65
SIMD computers
• Used in ‘60, ‘70 and ‘80 to build super-computers with dedicated CPUs
built only for these machines & not as off the shelf components!
. CRAY Research Co. – ULB had one, you can see it from Av. Buyl
. Thinking Machines – Connection Machines CM1 to CM5

• During ’90, super-computers started to use general purpose CPUs


found in normal desktop computers to cut the cost, but adding special
HW features (e.g. Silicon Graphics started using Intel CPUs)
• From that time and on SIMD moved into General Purpose computing

ELEC-H-473 Th06 16/65


Why SIMD moved into CPU architectures?
• Multimedia data: audio, image, video ... and combinations
• All these use Digital Signal Processing (DSP) algorithms and
these algorithms need high-performance CPUs; examples:
. Speech compression, filters & recognition algorithms
. Video display and capture routines
. Rendering routines & 3D graphics (geometry)
. Image and video processing algorithms
. Spatial (3D) audio
. Physical modeling (graphics, CAD)
. Encryption algorithms, complex arithmetics
• Most DSP algorithms (true for 1, 2 or 3D) have similar properties:

Huge arrays of small data types on which we should do


the same thing (i.e. apply same instructions)
Looks like a perfect candidate for SIMD!
ELEC-H-473 Th06 17/65
SIMD is found everywhere!
• HPC general purpose CPUs (Intel, AMD, Apple M1, etc.) &
dedicated processors (GPUs from Nvidia, gaming CPUs etc.)
• Mobile processors: ARM’s NEON (supported by Apple silicon);
MIPS MDMX (MaDMaX) and MIPS-3D
• Example: Cell Processor in Playstation
. Synergistic Processing Elements
(SPE): 8 fully functional, but simplified
cores, next to one PowerPC Processing
Element (PPE) general purpose core
. Each SPE has 7 execution units
including SIMD floating point unit;
SPEs are dual issue with max. 2
instructions issued in parallel in each
cycle; no branch prediction
. No cache, but 512KB of local memory
ELEC-H-473 Th06 18/65
Is your application suited for SIMD?
D!SL&J%T!N%GLRS%INDQLEFDEKNFG

Intel proposes the following chart to see if the SIMD is for you:
• Do we need to speed-up our code? If >479+)*P"`1+"'L1+.")9"H147

yes identify SW bottlenecks ... =1


H147"E797*)+.
*61K "'>3@

• Do we use FP? (left branch) d7.

. Do we really need FP computations ,01-+)9\"S1)9+


>9+7\76"16
>9+7\76
*01-+)9\?L1)9+j

for precision? If not, convert to int


. What is the FP size? Do we g (P",Sj S76*16K -9:7
>*"L1..)E07&"67?-66-9\7"4-+-
*16"'>3@"7**):)79:P

REALLY need FP of that size?; W-9\7"16


S67:).)19 #0)\9"4-+-".+6;:+;67.

• Can SW be vectorized? (right branch) H-9":19A76+


+1">9+7\76j
d7.
H(-9\7"+1";.7
'>3@">9+7\76
H19A76+"+1":147"+1";.7
'>3@"<7:(9101\)7.

. Could single instruction be applied


,10012"\7976-0":14)9\
\;)470)97."-94"'>3@
=1
:14)9\"\;)470)97.

to different data sets at the same H-9":19A76+"+1 H(-9\7"+1";.7


%.7"K 7K 16P"1L+)K )e-+)19.
d7. -94"L67*7+:(")*"-LL61L6)-+7
')9\07?L67:).)19j ')9\07"S67:).)19
time? If yes: bingo → program ':(74;07")9.+6;:+)19."+1
=1 1L+)K )e7"L76*16K -9:7

could be implemented in SIMD;


'<XS

note that this question is not that X3BFBFG

simple to answer, see Section 7 K4Z51%*).0S**?;#[%1$4#Z*$;*H$1%+94#Z*H">E*Cb$%#64;#6*?3+1$

ELEC-H-473 Th06 19/65


SIMD usage overview
• Disadvantages
. Not all algorithms can benefit from data parallelism; once written,
program is most likely to be architecture dependent, so less portable
. Register files & SIMD HW support in cores cost area; but looking at
everything else (e.g. memory hierarchy) the overhead is acceptable
. If the program is referencing data in non-contiguous manner, this is
a BIG problem (sometimes it can be solved, but not always)
. No guarantees of automated use – code vectorization is possible
but not guaranteed see Section 7
. Programming can be difficult; even simple programs could take
anything from little to significant development time (nobody likes
that); time to market can be accelerated using pre-built libraries
• Advantages
. Computational speed → 10× improvements are not uncommon!

ELEC-H-473 Th06 20/65


3. SIMD extensions in Intel CPUs

ELEC-H-473 Th06 21/65


Overview
• Intel has introduced SIMD computational paradigm in their
general purpose CPUs for quite some time now; different SIMD
extensions to standard instruction set have been proposed:
. MMX – introduced in 1997 with Pentium MMX CPU
. SSE – from 1999 to 2007 in NetBurst & Prescott CPUs
. AVX – 1st and 2nd generation in 2016 until now
• These SIMD extensions differ in:
. Operand size – increased from initial 64-bits to 256, and now 512
bits; this is 8× in 20 years!
. Instruction set – more instructions added to simplify SW; more
HW, since more gates per cm2 of an IC
. HW resources – each extension introduces more memory & more
dedicated computational resources
• All Intel SIMD extensions are backward compatible: CPU with
AVX2 will run MMX, but it doesn’t work the other way around
. What happens if you execute instruction that doesn’t exist in ISA?
ELEC-H-473 Th06 22/65
MMX – MultiMedia eXtension
• Stands for Multiple Math eXtension and Matrix Math eXtension
• Idea: reuse FP ALU & Register File for SIMD; RF is the same as
the regular one; just register name aliasing to save resources
• 12 FP registers are 80 bits wide, so 64 bits store packed data:
. Assume double precision IEEE floating point format (64-bits)
. Any of 2×32-bits, 4×16-bits, 8×8-bits integer
. Exclusive FP or SIMD operation, so no concurrent execution of FP
& SIMD instruction (normal, no dedicated HW for SIMD)

← 64 bits → Vector
X 1 × 8 Bytes (FP)
X X 2 × 4 Bytes
X X X X 4 × 2 Bytes
X X X X X X X X 8 × 1 Byte

ELEC-H-473 Th06 23/65


SIMD instructions & data types
• Imagine that we have a CPU that can work in SIMD mode with
data vectors of 64 bits (width of the SIMD ALU and registers)
• ISA provides mov instruction, that loads 64 bit word from
memory into one of 12 mmx registers; memory transfers are
data-type agnostic, they just manipulate words of bits
• ALU can operate in 8 or 16-bit modes (so different data-types)
and let’s assume add operation using some kind of SIMD
instruction padd
• Because we have two different data types and no way to
distinguish between single or 2-bytes operations, we need two
different instructions to perform addition:
. paddb – does the add operation on 1-byte elements
. paddw – does the add operation on 2-bytes elements
• Decoded instructions generate ctrl signals to configure SIMD HW
ELEC-H-473 Th06 24/65
SIMD instructions & data types: example
EDX and EBX point to 64-bit arrays (8 × 8-bit words)

edx 1
2 8 7 6 5 4 3 2 1 mm0 mov mm0, [edx]
3
4 7 6 5 4 3 2 1 0 mm1 mov mm1, [ebx]
5
6
7 15 13 11 9 7 5 3 1 mm1 paddb mm1, mm0
8
ebx 0
1
2 2055 1541 1027 513
3 mm0 mov mm0, [edx]
4 1798 1284 770 256
5 mm1 mov mm1, [ebx]
6
7 3853 2528 1797 769
8 mm1 paddw mm1, mm0

Make sure you understand the result of the code above !

ELEC-H-473 Th06 25/65


SSE – Streaming SIMD Extension
• MMX evolution, introduce 128-bit wide registers: xmm0-xmm7
• Not just new names: they are not any more aliased FP registers
but a dedicated Register File (thanks CMOS scaling!)
• Extra floating point instructions added to ISA, so:
. Data movements: MOVAPS, MOVUPS, MOVLPS, MOVHPS,
MOVLHPS, MOVHLPS
◦ MOVAPS, MOVUPS – aligned & non-aligned memory accesses we will
discus this more in details in Section 5
. Arithmetic: ADDPS, SUBPS, MULPS, DIVPS, RCPPS,
SQRTPS, MAXPS, MINPS, RSQRTPS
. Compare: CMPSS, COMISS, UCOMISS, CMPPS
. Data shuffle and unpacking: CVTSI2SS, CVTSS2SI,
CVTTSS2SI
. Bitwise logical operations: ANDPS, ORPS, XORPS, ANDNPS

ELEC-H-473 Th06 26/65


SSE versions
SSE extensions evolved in time:
• SSE2 – 2001, Pentium4
. New math instructions for double precision (64-bit) FP
. SIMD operations on any data type: from 8-bit integer to 64-bit float
entirely with xmm vector-register file, no need to use the legacy
MMX or FPU registers
• SSE3 – 2005, Prescott (variant of Pentium4)
. Added specific Digital Signal Processing & 3D graphics instructions
. More instructions that enable easy manipulation of the words inside
xmm registers (array elements)
. Better transfer for unaligned memory access

• SSE4 – 2006 in various flavors


. ... more general purpose instructions, not only multi-media

ELEC-H-473 Th06 27/65


AVX – Advanced Vector Extensions 1st generation
• Next major evolution of SSE (2011)
• Main features:
. Data path increases from 128 bits to 256 bits with Register File of
16x256 words named ymm0 to ymm15
. Before all instructions on 2 operands only with the following format:
xmm0 ← xmm0 + xmm1
i.e., one source & one destination registers must be the same; the
above will erase the content of xmm0; if you want to keep xmm0
value, you need to copy its content in the Register File; data
pollution will cause more frequent memory access & performance loss
. AVX introduces 3-operand instructions:
ymm2 ← ymm0 + ymm1
ymm0 & ymm1 remain untouched since dst is now ymm2
• AVX requires OS support (it will not work with older OSs)

ELEC-H-473 Th06 28/65


AVX – 2nd generation
• AVX2 – yet another evolution (2013); introduces 512 bits wide
operands for even bigger vectors
• AVX2 became AVX-512 (2015)
. And we will stop here!!!
• Register File is now composed of 32x512 words; this is double
compared to normal AVX in terms of number of words and word
size; register names: zmm0 to zmm31
• Previous 128-bit registers (xmm0 to xmm31) and 256-bit registers
(ymm0 to ymm31) are sub-set of the above SIMD Register File
• Different instruction sets for different CPUs family; this is new
since in the past entire SIMD instruction set of a given extension
family has been supported by all CPUs in the same generation
• Supports 4-operands operation

• This is too many options! How to program becomes a challenge


ELEC-H-473 Th06 29/65
4. Identifying the CPU

ELEC-H-473 Th06 30/65


How to write portable SW for so many SIMD ISA?
• There is no magic: architecture dependent optimizations, those
that use the most of a given CPU architecture & SIMD extension,
do mean that you need to write a piece of code for each CPU
variant you may encounter (i.e. SIMD extension)
• First, you need to identify your CPU exactly; for Intel CPUs you
can use assembly instruction cpuid to identify the exact
architecture you are running
• When called, this instruction will fill standard registers with codes
that are unique for a given CPU architecture & model
. Depending on the value of first 4 MSBs in EAX, different info may
be gathered
. If EAX=0 then you get a manufacturer ID string – 12 ASCII string
stored in EBX, EDX, ECX
. If EAX=1 then you will get extended processor info in EAX and
feature bits in EBX, EDX, ECX
ELEC-H-473 Th06 31/65
Intel CPU families and their exact identification
• Assume EAX=1 before cpuid call → following info found in
EAX, 32-bits register (here we limit us to EAX content)

EAX
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16

Reserved Extended Family ID Extended Model ID

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Reserved CPU type Family ID Model MaskID

• Exact CPU model is derived from Model, Extended Model ID and


Family ID fields (we have many options)
• Very low-level information since even mask set (MaskID) for
wafer processing is specified!

ELEC-H-473 Th06 32/65


Putting everything together
Using manual processor dispatch method:
• First: __declspec(cpu_ dispatch(cpuid,cpuid,...)) is used
to provide a list of targeted CPU architectures, along with an empty
function body (AKA function stub)
• Then __declspec(cpu_specific(cpuid)) is used to declare each
function version targeting particular CPU architecture
• Intel CPU type is detected at runtime using cpuid), and the
corresponding function version is executed
 
1 #include <stdio.h>
2 __declspec(cpu_dispatch(generic, future_cpu_16)) // list of functions
3 void dispatch_func() {}; // empty function stub
4 __declspec(cpu_specific(generic)) void dispatch_func() {
5 printf("Code for non-Intel processors and generic Intel\n");
6 }
7 __declspec(cpu_specific(future_cpu_16)) void dispatch_func() {
8 printf("Code for 2nd generation Intel Core processors goes here\n");
9 }
10 int main() {
11 dispatch_func();
12 printf("Return from dispatch_func\n"); return 0;
13 }
 
ELEC-H-473 Th06 33/65
5. Memory alignment

ELEC-H-473 Th06 34/65


On memory transfers – top-down
• Central memory (DRAM) is accessed using 64-bit words; because
of DRAM technology and DDR protocols access to memory is
more complicated than juste simple R/W/Clk; we need a full
blown digital circuit to enable DRAM access → DRAM controller
• DRAM pin count limited by the cost, BW limited by the interface
speed; HBMs introduce larger interfaces with more data pins
since IC-package level integration
• On-chip SRAM cache interfaces may be larger, since implemented
at IC-level (we can have more wires running at higher speed)
• For all these memories accessing single byte is far from being
optimal; we want to perform bursts of data transfers; this
optimizes transfer time/energy per bit (byte)
• Processors are designed so that memory transfers from certain
addresses are more efficient (and we know memory transfers are
important for SW performance because of latency involved)
ELEC-H-473 Th06 35/65
Memory alignment
• Memory address A is said to be aligned if (A + 1) mod n = 0,
where n is the width of the accessed data in bytes
• When a memory address is misaligned, the value (A + 1) mod n
determines the offset from the alignment: HW needs to “extract”
data using this offset
. Assume 1 Byte address and 8-Bytes address alignment; addresses
0x0, 0x7, 0x10 are aligned, but not 0x3
Ox
0 1 2 3 4 5 6 7 8 9 A B C D E F
x x x x x x x x x x x x x x x x
↑ ↑ ↑ ↑

• Intel CPUs (AVX-512) are designed so that memory movements


are optimal if performed on 64-byte boundary aligned addresses

ELEC-H-473 Th06 36/65


How to align data?
• Consider allocation of a static variable: the use of appropriate
clause/attribute tells the compiler what to do with the address
 
1 __declspec(align(64)) float A[1000];
 
• For dynamic allocation you need to use dedicated malloc
functions; in case of Intel compilers _mm_malloc() and
_mm_free(); other compilers do provide similar functions
(program is not only architecture, but also compiler dependent)
 
1 char *buf;
2 buf = (char*) _mm_malloc(bufsizes[i], 64);
3 ...
4 _mm_free(buf);
 
• Allocating aligned memory is not enough, compiler should now
addresses are aligned: __assume_aligned(a,64); compiler
can use instructions for aligned memory accesses! (slide 26)
ELEC-H-473 Th06 37/65
Aligning composite data types
• Consider

following structure to be used in an array: 
1 struct myStruct {
2 short a, b; // 2x2 Bytes
3 int c, d; // 2x4 Bytes
4 unsigned char e[4]; // 4x1 Byte
5 };
 
• We know data will be allocated contiguously; some data elements will
not be aligned, & their access slow; example with 8-bytes alignment:
0 1 2 3 4 5 6 7 8 9 A B C D E F
a a b b c c c c d d d d e e e e
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑

• Solution – insert “empty bytes” to enable aligned access to each


element in the structure → padding (manual or compiler); we
waist memory resources, but memory is sometimes cheap
0 1 2 3 4 5 6 7 8 9 A B C D E F ...
a a 0 0 0 0 0 c c c c 0 0 0 0 d ...
↑ ↑ ↑ ↑ ↑ ...
ELEC-H-473 Th06 38/65
SIMD and memory alignment
• In SIMD efficient vector move is strongly desirable because of the
operand sizes (AVX-512 deals with vectors of 512 bits)!
• Memory moves: aligned movaps & non-aligned movups
• Loop with aligned access
 
1 movaps xmm0, [b] ; load 16 bytes from b[]
2 addps xmm0, [c] ; add xmm0 with c[]
3 movaps [a] , xmm0 ; store xmm0 in the a[]
 
• Loop with non-aligned access
 
1 movups xmm0 , [b+2] ; load 16 bytes from b[]
2 movups xmm1 , [c+3] ; load 16 bytes from c[]
3 addps xmm0 , xmm1 ; xmm0 = xmm0 + xmm1
4 movups [a+1], xmm0 ; store xmm0 in the a[]
 
• Important note: use of aligned memory on non-aligned address
will most likely cause your program to crash

ELEC-H-473 Th06 39/65


Importance of aligned access
• Table shows MMX (Pentium3) vs. SSE (Pentium4) aligned
memory accesses acceleration against non-aligned accesses, for
different data types & array size 1
Array size char short integer float
MMX SSE MMX SSE MMX SSE MMX SSE
256 1.46 1.64 1.58 1.8 1.05 1.71 3.37 4.73
512 1.58 1.78 1.66 2.01 1.1 1.91 3.72 5.46
1024 1.66 2.02 1.74 2.24 1.08 2.04 4.00 3.96
32768 1.13 2.7 1.12 2.10 1.1 1.91 1.41 2.47
64000 1.28 2.71 1.23 2.20 1.16 1.04 1.45 1.13
1048576 1.15 1.83 1.11 1.99 1.1 1.25 1.36 1.53
4194304 1.15 1.94 1.11 2.00 1.08 1.15 1.38 1.24
16777216 1.14 2.24 1.12 1.89 1.09 1.15 1.37 1.23
Average 1.32 2.11 1.33 2.03 1.09 1.52 2.26 2.72

Up to 3× → this is significant!
1 Source: Performance Impact of Misaligned Accesses in SIMD Extensions
ELEC-H-473 Th06 40/65
Memory alignment & caches
• Assume 64-byte wide cache line and 16 byte address alignment; 4
data blocks form 1 cache line; 1 cell below is 4Bytes wide
• If data accesses at 64 byte boundary, cache is well used because
read operation is at the boundary and will use all 64 bytes
• If not, then two cache line reads are needed to allow the access to
the data – cache line split; this will affect performance
16 Byte boundary

X 4B aligned

X split 4 Bytes

X split 8 Bytes

X split 12 Bytes

cache line split

ELEC-H-473 Th06 41/65


6. SIMD programming for Intel CPUs

ELEC-H-473 Th06 42/65


How to use SIMD in Intel CPUs?
After all there are not that many options:
a. Assembly – do it the hard way! use assembly instructions and
program CPU directly; instructions could be CPU architecture
dependent: you may use more recent SIMD extensions with new
instruction set, but the code will not run on older CPUs
b. Intrinsics – assembly instructions are abstracted with C primitive
functions; close to assembly, but just a little bit more user friendly
& portable; just mentioned here ...
c. Intel Performance Libraries – already implemented functions,
supplied in libraries; you just need to interface them and compile
your code; generic functions, could be further optimized, but best
possible time-to-market option for SIMD
d. Automatic vectorization – some compilers can do this, but don’t
expect miracles (we will discuss this a little bit more)
ELEC-H-473 Th06 43/65
#LHPO@ Y<_)LAAPI?OC?@I)?G@)?OCN@<JMMI)L>TJAT@N)L>)?G@)K@OMJOBC>F@)JM)GC>N<FJN@N)
CII@BRAD)T@OIPI)?G@)@CI@)JM)KOJHOCBBL>H)C>N)KJO?CRLAL?D4
Trade-off for different SIMD
SIMD programming options → performance vs. ease of use:

#..7KE0P >9.+6)9.):.
!.#Y"#QB-F.

#;+1K-+):
I7:+16)e-+)19

HRHOOR,16+6-9

1BP.%"Y%!#"/#BQQ?-/c!"#$BU?:?$M

As usual@+#,.?;,%,*-66%9a&Y*+#,*@4Z3.G%[%&*?;984&%1*A%1=;19+#2%*B1+,%.;==6
K4Z51%*)./S** → no free lunch – assembly best performance, but worse
ease of use; automated vectorization could reach good results but
poor ones too! (other solutions in between)
(G@)@`CBKA@I)?GC?)MJAAJQ)LAAPI?OC?@)?G@)PI@)JM)FJNL>H)CNUPI?B@>?I)?J)@>CRA@)?G@)CAHJ<
OL?GB)?J)R@>@ML?)MOJB)?G@)++/4)(G@)ICB@)?@FG>LZP@I)BCD)R@)PI@N)MJO)IL>HA@<KO@FLILJ>)
MAJC?L>H<KJL>?5)NJPRA@<KO@FLILJ>)MAJC?L>H<KJL>?5)C>N)L>?@H@O)NC?C)P>N@O)+++/^5)++/^5)
ELEC-H-473 Th06 44/65
a. Assembly & in-line assembly
• You could use assembly & compiler, but more user friendly solution is
in-line assembly → insert assembly at any point in C/C++ code
• Example – simple loop in C:
 
1 void add(float *a, float *b, float *c) {
2 for (int i = 0; i < 4; i++)
3 c[i] = a[i] + b[i];
4 }
 
becomes using in-line SIMD assembly:
 
1 void add(float *a, float *b, float *c) {
2 __asm {
3 mov eax, a
4 mov edx, b
5 mov ecx, c
6 movaps xmm0, xmmword ptr [eax] // note new move instructions
7 addps xmm0, xmmword ptr [edx] // note new processing instruc.
8 movaps xmmword ptr [ecx], xmm0 // Pointers must be aligned!!!
9 }
10 }
 
• This is what we are going to do ...
ELEC-H-473 Th06 45/65
b) Intrinsics
• Intrinsics – predefined C functions that map directly assembly
instructions for easier use (no need for assembly, compiler is OK)
• Performance could be as good as assembly, but easier to write
• Compatible across different compilers (if they support intrinsics)
 
1 void add(float *a, float *b, float *c) {
2 for (int i = 0; i < SIZE; i++) { !\hl{//Loop in C}#!
3 c[i] = a[i] + b[i];}
4 }
 
 
1 #include <xmmintrin.h>
2 void add(float *a, float *b, float *c) {
3 __m128 t0, t1; // XMM 128 bit registers
4 t0 = _mm_load_ps(a); // load 128 bit data
5 t1 = _mm_load_ps(b);
6 t0 = _mm_add_ps(t0, t1); // SIMD add
7 _mm_store_ps(c, t0);
8 }
 
Note xmmintrin.h header file! Why do we need this?
ELEC-H-473 Th06 46/65
Practical assembly & intrinsics
• We have seen that SIMD instruction sets become reacher with
every new extension; old instructions are kept in ISA for backward
SW compatibility
• If in the past you missed a function & you blew your brains out to
figure out the vector (SIMD) algorithm for it (personal
experience), in the next SIMD generation this function might
appear as a single instruction with appropriate HW support
• There is a chance that the above HW will further improve the SW
performance & not just ease up code writing
• Advice → you need to be up to date with most recent
developments: you need to read documentation
• Intel is very good in making documents, assuming that you do the
effort; there is lot of documentation with many, many pages, with
loads of useful information; soft-skill: how to browse
documentation quickly & find what you are looking for
ELEC-H-473 Th06 47/65
c) Intel Integrated Perf. Primitives – IPPs (2019)
• What it is?
Highly optimised SW library that provides a comprehensive set of
application domain specific plug-in functions to outperform any com-
piler specific optimisation; use of Streaming SIMD Extensions (SSE),
Advanced Vector Extensions 2 (AVX2), and Advanced Vector Exten-
sions 512 (AVX-512) instruction sets; target different Intel processors:
Atom. Core, & Xeon
• How to use?
With Intel Parallel Studio XE and System Studio (Intel IDEs)

• What do they cover?


Low-level building blocks for signal (1D) signal processing & (2D,
3D) image processing, computer vision and data processing (data
compression / decompression and cryptography) applications

ELEC-H-473 Th06 48/65


Example of IPPs usage
• Function ippsTriangle_16s generates 1D triangular signal with
specified frequency rFreq, phase pointed by pPhase, and magnitude
magn argument
• Function computes len samples of the triangle, and stores them in the
array pDst (pointer to a pre-defined data type)
• Triangle: asymmetric if h in range (−π, π); symmetric if h = 0
 
1 void func_triangle_direct() {
2 Ipp16s* pDst; // predefined specific data types
3 int len = 512;
4 Ipp16s magn = 4095;
5 Ipp32f rFreq = 0.02;
6 Ipp32f asym = 0.0;
7 Ipp32f Phase = 0.0;
8 IppStatus status;
9
10 status = ippsTriangle_16s(pDst, len, magn, rFreq, asym, Phase);
11 if(ippStsNoErr != status)
12 printf("Intel(R) IPP Error: %s",ippGetStatusString(status));
13 }
 
ELEC-H-473 Th06 49/65
d) Automated vectorization
• Some compilers provide support for automated SIMD code
generation out of standard C statements
• Note that this can be done only in specific cases, when the
C-code is written in such way that the compiler can effectively
vectorize the code
• We will see how SIMD is sensitive to algorithm; compiler can’t
change the algorithm, or SIMD is always about another way of
doing things
• Automated vectorization is often used in conjunction with
compiler directives, i.e. user defined hints to compiler to simplify
vectorization process (just think of memory alignment)
• Bottom line → don’t expect that something like:
compile -auto_vectorise input.c
works out of the box, even if some compilers may have the switch
ELEC-H-473 Th06 50/65
7. Automated compiler vectorization

ELEC-H-473 Th06 51/65


Background
• Almost all compilers provide some kind of -o switch to turn on
automated optimization with the aim to improve performance of the
produced executable
• Take Intel C++ Compiler – ICC (similar with gcc or MSVisualC):
-O0 No optimization
-O1 Optimise for size
-O2 Optimise for speed, note exclusivity with above
-O3 Enable O2 + intensive loop optimisations
-msse3 Enables SSE for non-Intel CPUs
• Thus, optimizations may include SIMD code generation – ICC will look
for vectorization opportunities mostly on loops whenever you compile
with -O2 or higher switch; works for Intel & non-Intel CPUs, but
possibly much better for Intel micro-architecture, Intel knows ©
• Vectorization may be explicitly disabled with -no-vec to make direct
performance comparison of the executable
• Vectorization report may be generated with -vec-report:
MultArray.c(92): (col. 5) remark: LOOP WAS VECTORIZED
ELEC-H-473 Th06 52/65
1. Loops must be countable
• Loop count must be known only at the entry to the loop, so at
run-time, but not at compile-time
• Counter could be a variable, but the variable must remain
constant for the duration of the loop
• This also implies that exit from the loop must not be
data-dependent
 
1 SIZE = z*y;
2
3 for (j = 0;j < SIZE; j++) {
4 b[j] = a[j] * x[j]; // this should be ok
5 }
6 for (j = 0;j < SIZE; j++) {
7 b[j] = a[j] * x[j];
8 SIZE = b[j]; // NOT OK, SIZE can’t be mod
9 }
 

ELEC-H-473 Th06 53/65


2. Single entry and single exit
• Implied by previous; code below can’t be vectorized due to the
second data-dependent exit (branch):
 
1 void no_vec(float a[], float b[], float c[]) {
2 int i = 0;
3 while (i < 100) {
4 // this should be ok
5 a[i] = b[i] * c[i];
6 }
7 while (i < 100) {
8 // but this is NOT ok
9 // data-dependent exit condition:
10 if (a[i] < 0.0) break;
11
12 // Can you write a condition on single SIMD element?
13 ++i;
14 }
15 }
 

ELEC-H-473 Th06 54/65


3. Straight-line code
• Different iterations MUST have same control flow, i.e. they must
not branch; however, if statements may be vectorized if they can
be implemented as masked assignments: the calculation is
performed for all data elements, but the result is stored only
for those elements for which the mask evaluates to true
 
1 #include <math.h>
2 void quad(int length, float *a, float *b,
3 float *c, float *restrict x1, float *restrict x2) {
4 for (int i=0; i<length; i++) {
5 float s = b[i]*b[i] - 4*a[i]*c[i];
6 if ( s >= 0 ) {// skipped in SIMD, we compute everything
7 s = sqrt(s) ;
8 x2[i] = (-b[i]+s)/(2.*a[i]);
9 x1[i] = (-b[i]-s)/(2.*a[i]);
10 } else { // and this is then masked
11 x2[i] = 0.; // make sure you understand how
12 x1[i] = 0.; // this can be done
13 }
14 }
15 }
 
ELEC-H-473 Th06 55/65
4. Nested loops must be independent

• There must not be loop carried dependence – when code in a


loop iteration depends on the output of previous loop iteration
. This is a general problem in loop parallelization
 
1 // True even in a single loop, look at example below
2 // 1st iteration: a[2]=a[1];
3 // 2nd iteration: a[4]=a[2] -> uses a[2] modified previously
4 for (i = 1; i < N; i++)
5 a[2*i] = a[i];
6
7 // Both indexes will create the above situation
8 for (i = 0; i < N; i++)
9 for (j = 0; j < N; j++)
10 a[i+1][j-2] = a[i][j] + 1;
 

ELEC-H-473 Th06 56/65


5. No function calls
• It is not possible to call other functions from the loop; even a
printf will make a loop non vectorizable
 
1 for (i = 0; i < N; i++) {
2 c[i] = a[i] * b[i];
3 printf("%d",c[i]);
4 }
 
• In the above case you may be prompted with a message:
nonstandard loop is not a vectorization candidate
• Exceptions:
. Intrinsic math functions (they are already SIMD)
. In-line functions – these are “hard” copies of functions inserted
directly in the compiled code; for such functions there will be no
function calls and push/pop → so no function call overheads

ELEC-H-473 Th06 57/65


Obstacles to vectorization 1/4
1. Non-contiguous Memory Accesses – SIMD vectorization killer
. 64 or 128 bits can be loaded directly from memory in a single SSE
instruction only if they are adjacent; and this is efficient
. If not, we need multiple load instructions (slows down everything)
. Examples: non-unit step or indirect memory access; compiler rarely
vectorizes such loops, unless the amount of computational work is
large compared to the overhead from non-contiguous memory access
 
1 // arrays accessed with step 2
2 for (int i=0; i<SIZE; i+=2)
3 b[i] += a[i] * x[i];
4 // Inner loop accesses array a with inverted indexes
5 // This will also guarantee cache miss if SIZE is big
6 for (int j=0; j<SIZE; j++) {
7 for (int i=0; i<SIZE; i++)
8 b[i] += a[i][j] * x[j]; // i,j indexes inverted
9 }
10 // Indirect addressing of x using index array
11 for (int i=0; i<SIZE; i+=2)
12 b[i] += a[i] * x[index[i]]; // x[index[i]]
 
ELEC-H-473 Th06 58/65
Obstacles to vectorization 2/4
2. Data dependencies – vectorization will change execution order, so
it can do so only if it preserves results of computations!
Simplest case is when data elements that are written do not
appear in any other iteration; all the iterations of the original loop
are independent of each other, and can be executed in any order,
without changing the result
More generally we can have four options depending on Read or
Write order (this is known from pipeline hazards)
2.1 Read-after-Write – next step uses result from the previous one:
 
1 A[0]=0;
2 for (j=1; j<MAX; j++)
3 A[j]=A[j-1]+1;
4 // Loop moves into this direction -->
5 // A[1]=A[0]+1; A[2]=A[1]+1; A[3]=A[2]+1; A[4]=A[3]+1;
6 // ^_________________^
7 // ^_________________^
8 // ...
 

ELEC-H-473 Th06 59/65


Obstacles to vectorization 3/4
2.2 Write-after-Read – When a variable is read in one iteration and
written in a subsequent iteration, this is a write-after-read
dependency (AKA anti-dependency):
 
1 for (j=1; j<MAX; j++)
2 A[j-1]=A[j]+1;
3 // A[0]=A[1]+1; A[1]=A[2]+1; A[2]=A[3]+1; A[3]=A[4]+1;
4 // ^---> written after being used
 
. Above is not safe for general parallel execution, since write may occur
before the read, if we have OoO execution for example
. However vectorization is safe since no iteration with a higher value of j
can complete before an iteration with a lower value of j
. Following may not be safe: vectorization might cause some elements of A
to be overwritten by the 1st before being used in 2nd SIMD instruction
 
1 for (j=1; j<MAX; j++) {
2 A[j-1]=A[j]+1;
3 B[j]=A[j]*2; // this one is problematic
4 }
5 // A[0]=A[1]+1; A[1]=A[2]+1; A[2]=A[3]+1; A[3]=A[4]+1;
6 // B[1]=A[1]*2 ---^
7 // ---> A[1] may be written after being used 2nd SIMD instruction
 
ELEC-H-473 Th06 60/65
Obstacles to vectorization 4/4
2.4 Read-after-Read – not really dependency, will not prevent
vectorization; if variable is not written, it doesn’t matter how often
it is read
2.5 Write-after-Write – same variable is written to in more than one
iteration;

this is unsafe for parallel execution ! 
1 // This is ok because we can accumulate the sum
2 sum=0;
3 for (j=1; j<MAX; j++)
4 sum = sum + A[j]*B[j];
5 // Write the SIMD code yourself (after next session)
6
7 // Anti-example: typical case of vectorisation
8 for (i = 0; i < size; i++) {
9 c[i] = a[i] * b[i];
10 // The above will work only if c, a and b
11 // are non-overlapping pointers
12 // If c, a and b are possibly overlapping
13 // compiler will not be able to vectorise
14 // you need to supply a hint; see next slide
15 }
 
ELEC-H-473 Th06 61/65
Pragmas – or how to help compiler vectorize
• Compilers can’t make miracles ... the above examples show the
degree of complexity they have to deal with
• No compiler could solve all problems by itself: welcome to pragma
directives – user specified hints to improve vectorization and
minimize compile time; why these two relate?
• By inserting #pragma ivdep before the loop; compiler will know
that it can safely ignore any potential data dependencies
 
1 #pragma ivdep // we know what are we doing
2 for (i = 0; i < size; i++) {
3 c[i] = a[i] * b[i];
4 // The above will work !
5 // a, b, c non overlapping pointers!
6 // i.e. data is independent, compiler can safely ignore
7 }
 
• The compiler will not ignore proven dependencies
• Use of this pragma when there are in fact dependencies may lead
to incorrect results; suggest an example?
ELEC-H-473 Th06 62/65
Examples of other hints
• #pragma loop count (n) – used to advise the compiler if the
loop is worth while considering for SIMD optimization; note that
it is not efficient to vectorize small loops
• #pragma vector – forces compiler to vectorize the loop if it is
safe to do so, whether or not the compiler thinks that the
vectorization will improve performance
• #pragma vector align – asserts that data within the
following loop is aligned; generally to 16 byte boundary, for SSE
instruction sets
• #pragma novector – asks the compiler not to vectorize a loop;
you will use this when you know that it is not worth trying (and
you want to save compile time)
• #pragma vector nontemporal – gives a hint to the compiler
that data will not be reused, and therefore to use streaming stores
that bypass cache and accelerate memory accesses
ELEC-H-473 Th06 63/65
Streaming stores
• In vector processing input data is intended for one time-usage: once a
part of the vector is processed, we do not need it any more
• As opposed to temporal data, data will be used again (& the reason
why we have caches) we have non-temporal data
• Keeping all, so non-temporal data too, in the cache would make no
sense (we speak about cache pollution)
• This motivates so called non-temporal streaming stores
• We have a set of Streaming Load/Save Buffers close to CPU (just like
L1); fast memory, accessed directly from the main memory &
bypassing cache hierarchy (we can do that since data is used once)
• If data access can be anticipated early enough, these buffers provide
continuous stream of data to CPU with improved bandwidth
• Streaming stores performed using non-temporal move instructions:
MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, MOVNTPD (for ref. only)

ELEC-H-473 Th06 64/65


Some references
Some pointers to explore optimization aspects more in details:
• Intel 64 & IA-32 Architectures Optimization Reference Manual –
general optimization guidelines, good even if you program in
high-level language such as C; makes a link between SW & HW
(it is Intel specific, but some tricks applicable to AMD)
• Intel SSE4 Programming Reference – provides details about SIMD
programming and execution with SSE; you can assume SSE4
being widely supported
• Developer Guide for Intel Integrated Performance Primitives –
there is a collection of documents covering different application
domains from 1D to 2D and security; enumerates all functions,
their interfaces and how to use them

And of course there is much, much more ... you have to search

ELEC-H-473 Th06 65/65

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy