ELECH473 Th06
ELECH473 Th06
5. Memory alignment
6 hex digits (24-bits out of 32) on the left are not unused!
ELEC-H-473 Th06 6/65
How inefficient this is?
• Consider following program & assume more realistic 64-bit ALU:
1 unsigned char A, B, C;
2 int i;
3 for (i = 0 ; i < 1000000; i++) {
4 C[i] = A[i] + B[i];
5 }
Intel proposes the following chart to see if the SIMD is for you:
• Do we need to speed-up our code? If >479+)*P"`1+"'L1+.")9"H147
← 64 bits → Vector
X 1 × 8 Bytes (FP)
X X 2 × 4 Bytes
X X X X 4 × 2 Bytes
X X X X X X X X 8 × 1 Byte
edx 1
2 8 7 6 5 4 3 2 1 mm0 mov mm0, [edx]
3
4 7 6 5 4 3 2 1 0 mm1 mov mm1, [ebx]
5
6
7 15 13 11 9 7 5 3 1 mm1 paddb mm1, mm0
8
ebx 0
1
2 2055 1541 1027 513
3 mm0 mov mm0, [edx]
4 1798 1284 770 256
5 mm1 mov mm1, [ebx]
6
7 3853 2528 1797 769
8 mm1 paddw mm1, mm0
EAX
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Up to 3× → this is significant!
1 Source: Performance Impact of Misaligned Accesses in SIMD Extensions
ELEC-H-473 Th06 40/65
Memory alignment & caches
• Assume 64-byte wide cache line and 16 byte address alignment; 4
data blocks form 1 cache line; 1 cell below is 4Bytes wide
• If data accesses at 64 byte boundary, cache is well used because
read operation is at the boundary and will use all 64 bytes
• If not, then two cache line reads are needed to allow the access to
the data – cache line split; this will affect performance
16 Byte boundary
X 4B aligned
X split 4 Bytes
X split 8 Bytes
X split 12 Bytes
#..7KE0P >9.+6)9.):.
!.#Y"#QB-F.
#;+1K-+):
I7:+16)e-+)19
HRHOOR,16+6-9
1BP.%"Y%!#"/#BQQ?-/c!"#$BU?:?$M
As usual@+#,.?;,%,*-66%9a&Y*+#,*@4Z3.G%[%&*?;984&%1*A%1=;19+#2%*B1+,%.;==6
K4Z51%*)./S** → no free lunch – assembly best performance, but worse
ease of use; automated vectorization could reach good results but
poor ones too! (other solutions in between)
(G@)@`CBKA@I)?GC?)MJAAJQ)LAAPI?OC?@)?G@)PI@)JM)FJNL>H)CNUPI?B@>?I)?J)@>CRA@)?G@)CAHJ<
OL?GB)?J)R@>@ML?)MOJB)?G@)++/4)(G@)ICB@)?@FG>LZP@I)BCD)R@)PI@N)MJO)IL>HA@<KO@FLILJ>)
MAJC?L>H<KJL>?5)NJPRA@<KO@FLILJ>)MAJC?L>H<KJL>?5)C>N)L>?@H@O)NC?C)P>N@O)+++/^5)++/^5)
ELEC-H-473 Th06 44/65
a. Assembly & in-line assembly
• You could use assembly & compiler, but more user friendly solution is
in-line assembly → insert assembly at any point in C/C++ code
• Example – simple loop in C:
1 void add(float *a, float *b, float *c) {
2 for (int i = 0; i < 4; i++)
3 c[i] = a[i] + b[i];
4 }
becomes using in-line SIMD assembly:
1 void add(float *a, float *b, float *c) {
2 __asm {
3 mov eax, a
4 mov edx, b
5 mov ecx, c
6 movaps xmm0, xmmword ptr [eax] // note new move instructions
7 addps xmm0, xmmword ptr [edx] // note new processing instruc.
8 movaps xmmword ptr [ecx], xmm0 // Pointers must be aligned!!!
9 }
10 }
• This is what we are going to do ...
ELEC-H-473 Th06 45/65
b) Intrinsics
• Intrinsics – predefined C functions that map directly assembly
instructions for easier use (no need for assembly, compiler is OK)
• Performance could be as good as assembly, but easier to write
• Compatible across different compilers (if they support intrinsics)
1 void add(float *a, float *b, float *c) {
2 for (int i = 0; i < SIZE; i++) { !\hl{//Loop in C}#!
3 c[i] = a[i] + b[i];}
4 }
1 #include <xmmintrin.h>
2 void add(float *a, float *b, float *c) {
3 __m128 t0, t1; // XMM 128 bit registers
4 t0 = _mm_load_ps(a); // load 128 bit data
5 t1 = _mm_load_ps(b);
6 t0 = _mm_add_ps(t0, t1); // SIMD add
7 _mm_store_ps(c, t0);
8 }
Note xmmintrin.h header file! Why do we need this?
ELEC-H-473 Th06 46/65
Practical assembly & intrinsics
• We have seen that SIMD instruction sets become reacher with
every new extension; old instructions are kept in ISA for backward
SW compatibility
• If in the past you missed a function & you blew your brains out to
figure out the vector (SIMD) algorithm for it (personal
experience), in the next SIMD generation this function might
appear as a single instruction with appropriate HW support
• There is a chance that the above HW will further improve the SW
performance & not just ease up code writing
• Advice → you need to be up to date with most recent
developments: you need to read documentation
• Intel is very good in making documents, assuming that you do the
effort; there is lot of documentation with many, many pages, with
loads of useful information; soft-skill: how to browse
documentation quickly & find what you are looking for
ELEC-H-473 Th06 47/65
c) Intel Integrated Perf. Primitives – IPPs (2019)
• What it is?
Highly optimised SW library that provides a comprehensive set of
application domain specific plug-in functions to outperform any com-
piler specific optimisation; use of Streaming SIMD Extensions (SSE),
Advanced Vector Extensions 2 (AVX2), and Advanced Vector Exten-
sions 512 (AVX-512) instruction sets; target different Intel processors:
Atom. Core, & Xeon
• How to use?
With Intel Parallel Studio XE and System Studio (Intel IDEs)
And of course there is much, much more ... you have to search