Arm Cortex-X2 Core Software Optimization Guide
Arm Cortex-X2 Core Software Optimization Guide
Revision: r2p0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Release information
Document history
Issue Date Confidentiality Change
Your access to the information in this document is conditional upon your acceptance that you will not use
or permit others to use the information for the purposes of determining whether implementations infringe
any third party patents.
THIS DOCUMENT IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES,
EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR
PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, Arm makes no representation
with respect to, has undertaken no analysis to identify or understand the scope and content of, patents,
copyrights, trade secrets, or other rights.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES,
INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR
CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY,
ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE POSSIBILITY
OF SUCH DAMAGES.
This document consists solely of commercial items. You shall be responsible for ensuring that any use,
duplication or disclosure of this document complies fully with any relevant export laws and regulations to
assure that this document or any portion thereof is not exported, directly or indirectly, in violation of such
export laws. Use of the word “partner” in reference to Arm's customers is not intended to create or refer to
any partnership relationship with any other company. Arm may make changes to this document at any
time and without notice.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 2 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
This document may be translated into other languages for convenience, and you agree that if there is any
conflict between the English version of this document and any translation, the terms of the English version
of the Agreement shall prevail.
The Arm corporate logo and words marked with ® or ™ are registered trademarks or trademarks of Arm
Limited (or its affiliates) in the US and/or elsewhere. All rights reserved. Other brands and names
mentioned in this document may be the trademarks of their respective owners. Please follow Arm's
trademark usage guidelines at https://www.arm.com/company/policies/trademarks.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
(LES-PRE-20349)
Confidentiality Status
This document is Non-Confidential. The right to use, copy and disclose this document may be subject to
license restrictions in accordance with the terms of the agreement entered into by Arm and the party that
Arm delivered this document to.
Product Status
The information in this document is final, that is for a developed product.
Web Address
developer.arm.com
This document includes terms that can be offensive. We will replace these terms in a future issue of this
document. If you find offensive terms in this document, please email terms@arm.com.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 3 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Contents
1 Introduction .................................................................................................................................7
1.1 Product revision status ........................................................................................................................................... 7
1.2 Intended audience ................................................................................................................................................... 7
1.3 Scope............................................................................................................................................................................. 7
1.4 Conventions................................................................................................................................................................ 7
1.4.1 Glossary .................................................................................................................................................................... 7
1.4.2 Terms and abbreviations .................................................................................................................................... 8
1.4.3 Typographical conventions ............................................................................................................................... 9
1.5 Additional reading ................................................................................................................................................. 10
1.6 Feedback .................................................................................................................................................................... 11
1.6.1 Feedback on this product ................................................................................................................................ 11
1.6.2 Feedback on content ......................................................................................................................................... 11
2 Overview .................................................................................................................................... 12
2.1 Pipeline overview .................................................................................................................................................... 13
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 5 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 6 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
1 Introduction
1.1 Product revision status
The rmpn identifier indicates the revision status of the product described in this book, for
example, r1p2, where:
rm
pn
Identifies the minor revision or modification status of the product, for example,
p2.
1.3 Scope
This document describes aspects of the Cortex-X2 core micro-architecture that influence
software performance. Micro-architectural detail is limited to that which is useful for software
optimization.
Documentation extends only to software visible behavior of the Cortex-X2 core and not to the
hardware rationale behind the behavior.
1.4 Conventions
The following subsections describe conventions used in Arm documents.
1.4.1 Glossary
The Arm Glossary is a list of terms used in Arm documentation, together with definitions for
those terms. The Arm Glossary does not contain terms that are industry standard unless the Arm
meaning differs from the generally accepted meaning.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 7 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
MOP Macro-OPeration
µOP Micro-OPeration
FP Floating-point
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 8 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
bold Highlights interface elements, such as menu names. Denotes signal names. Also used
for terms in descriptive lists, where appropriate.
monospace Denotes text that you can enter at the keyboard, such as commands, file and program
names, and source code.
monospace bold Denotes language keywords when used outside example code.
monospace Denotes a permitted abbreviation for a command or option. You can enter the
underline underlined text instead of the full command or option name.
<and> Encloses replaceable terms for assembler syntax where they appear in code or code
fragments.
For example:
MRC p15, 0, <Rd>, <CRn>, <CRm>, <Opcode_2>
SMALL CAPITALS Used in body text for a few terms that have specific technical meanings, that are
defined in the Arm® Glossary. For example, IMPLEMENTATION DEFINED, IMPLEMENTATION
SPECIFIC, UNKNOWN, and UNPREDICTABLE.
This represents a recommendation which, if not followed, might lead to system failure
or damage.
This represents a requirement for the system that, if not followed, might result in
system failure or damage.
This represents a requirement for the system that, if not followed, will result in system
failure or damage.
This represents a useful tip that might make it easier, better or faster to perform a task.
This is a reminder of something important that relates to the information you are
reading.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 9 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 10 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
1.6 Feedback
Arm welcomes feedback on this product and its documentation.
Arm tests the PDF only in Adobe Acrobat and Acrobat Reader and cannot guarantee the quality
of the represented document when used with any other PDF reader.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 11 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
2 Overview
The Cortex-X2 core is a high-performance and low-power product that implements the Armv9.0-
A architecture and supports all previous Armv8-A architectures up to Armv8.5-A. It targets large
screen compute applications
This document describes elements of the Cortex-X2 core micro-architecture that influence
software performance so that software and compilers can be optimized accordingly.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 12 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Branch 0
Branch 1
Integer Single-Cycle 0
FP/ASIMD 0
Issue
FP/ASIMD 1
FP/ASIMD 2
FP/ASIMD 3
Load/Store 0
Load/Store 1
Load 2
Store data 0
Store data 1
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 13 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
The execution pipelines support different types of operations, as shown in the following table.
Instruction Instructions
groups
Branch 0/1 Branch µOPs
Integer Single/Multi- Integer shift-ALU, multiply, divide, CRC and sum-of-absolute-differences µOPs
cycle 0/1
Load/Store 0/1 Load, Store address generation and special memory µOPs
FP/ASIMD-0 ASIMD ALU, ASIMD misc, ASIMD integer multiply, FP convert, FP misc, FP add, FP multiply,
FP divide, FP sqrt, crypto µOPs, store data µOPs
FP/ASIMD-1 ASIMD ALU, ASIMD misc, FP misc, FP add, FP multiply, ASIMD shift µOPs, store data µOPs,
crypto µOPs.
FP/ASIMD-2 ASIMD ALU, ASIMD misc, ASIMD integer multiply, FP convert, FP misc, FP add, FP multiply,
FP divide, FP sqrt, crypto µOPs.
FP/ASIMD-3 ASIMD ALU, ASIMD misc, FP misc, FP add, FP multiply, ASIMD shift µOPs, crypto µOPs
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 14 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
3 Instruction characteristics
3.1 Instruction tables
This chapter describes high-level performance characteristics for most Armv9-A instructions. A
series of tables summarize the effective execution latency and throughput (instruction bandwidth
per cycle), pipelines utilized, and special behaviours associated with each group of instructions.
Utilized pipelines correspond to the execution pipelines described in chapter 2.
In the tables below, Exec Latency is defined as the minimum latency seen by an operation
dependent on an instruction in the described group.
In the tables below, Execution Throughput is defined as the maximum throughput (in
instructions per cycle) of the specified instruction group that can be achieved in the entirety of
the Cortex-X2 microarchitecture.
Branch 0/1 B
Integer multicycle 0 M0
Load/Store 01 L01
FP/ASIMD 0/1/2/3 V
FP/ASIMD 0 V0
FP/ASIMD 1 V1
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 15 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Branch, immed B 1 2 B -
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 16 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Notes:
1. The latency is 2, throughput is 2 and utilized pipeline is M when GCR_EL1.RRND = 1. When GCR_EL1.RRND = 0,
latency is 3, throughput is 1 and pipeline utilized is M0.
Notes:
1. Integer divides are performed using an iterative algorithm and block any subsequent divide operations until
complete. Early termination is possible, depending upon the data values.
2. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing a typical
sequence of multiply-accumulate µOPs to issue one every N cycles (accumulate latency N shown in parentheses).
Accumulator forwarding is not supported for consumers of 64 bit multiply high operations.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 17 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 18 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 19 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 21 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 22 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
FP compare FCCMP{E}, 2 1 V0 -
FCMP{E}
FP negate FNEG 2 4 V -
FP select FCSEL 2 4 V -
Notes:
1. FP divide and square root operations are performed using an iterative algorithm and block subsequent similar
operations to the same pipeline until complete.
2. FP multiply-accumulate pipelines support late forwarding of the result from FP multiply µOPs to the accumulate
operands of an FP multiply-accumulate µOP. The latter can potentially be issued 1 cycle after the FP multiply µOP has
been issued.
3. FP multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing a
typical sequence of multiply-accumulate µOPs to issue one every N cycles(accumulate latency N shown in
parentheses).
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 23 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 24 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 25 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 26 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 27 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 28 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Notes:
1. Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing a typical
sequence of integer multiply-accumulate µOPs to issue one every cycle or one every other cycle (accumulate latency
shown in parentheses).
2. Other accumulate pipelines also support late-forwarding of accumulate operands from similar µOPs, allowing a
typical sequence of such µOPs to issue one every cycle (accumulate latency shown in parentheses).
3. This category includes instructions of the form “PMULL Vd.8H, Vn.8B, Vm.8B” and “PMULL2 Vd.8H, Vn.16B, Vm.16B”.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 29 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 31 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Notes:
1. ASIMD multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing
a typical sequence of floating-point multiply-accumulate µOPs to issue one every N cycles (accumulate latency N
shown in parentheses).
2. ASIMD multiply-accumulate pipelines support late forwarding of the result from ASIMD FP multiply µOPs to the
accumulate operands of an ASIMD FP multiply-accumulate µOP. The latter can potentially be issued 1 cycle after the
ASIMD FP multiply µOP has been issued.
3. ASIMD divide and square root operations are performed using an iterative algorithm and block subsequent similar
operations to the same pipeline until complete.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 32 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 33 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 34 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 35 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Notes:
1. Writeback forms of load instructions require an extra µOP to update the base address. This update is typically
performed in parallel with the load µOP (update latency shown in parentheses).
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 36 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 37 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Notes:
1. Writeback forms of store instructions require an extra µOP to update the base address. This update is typically
performed in parallel with the store µOP (update latency shown in parentheses).
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 38 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
3.23 CRC
Table 3-22 AArch64 CRC
Instruction Group AArch64 Exec Execution Utilized Notes
Instructions Latency Throughput Pipelines
Notes:
1. CRC execution supports late forwarding of the result from a producer µOP to a consumer µOP. This results in a 1
cycle reduction in latency as seen by the consumer.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 39 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 40 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Notes:
1. When the governing predicate is the same as destination, the latency is increased by one cycle.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 41 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 42 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Extract EXT 2 4 V -
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 44 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 45 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 46 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Notes:
1. When the governing predicate is the same as destination, the latency is increased by one cycle.
2. SVE accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing a typical
sequence of such µOPs to issue one every N cycles (accumulate latency N shown in parentheses).
3. SVE integer divide operations are performed using an iterative algorithm and block subsequent similar operations
to the same pipeline until complete.
4. Same as 2 except that for saturating instructions require an extra cycle of latency for late-forwarding accumulate
operands.
5. If the consuming instruction has a flag source, the latency for this instruction is 4 cycles.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 47 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Notes:
1. SVE multiply-accumulate pipelines support late-forwarding of accumulate operands from similar µOPs, allowing a
typical sequence of floating-point multiply-accumulate µOPs to issue one every N cycles (accumulate latency N shown
in parentheses).
2. SVE divide and square root operations are performed using an iterative algorithm and block subsequent similar
operations to the same pipeline until complete.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 49 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Notes:
1. SVE pipelines that execute these instructions support late-forwarding of accumulate operands from similar µOPs,
allowing a typical sequence of µOPs to issue one every N cycles (accumulate latency N shown in parentheses).
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 50 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 51 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 52 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Contiguous store four structures ST2B, ST4D, ST4H, 11 1/9 L01, V01 -
from four vectors, scalar + imm ST4W
Contiguous store four structures ST4B, ST4D, ST4W 11 1/9 L01, S, V01 -
from four vectors, scalar + scalar
Scatter store vector + imm 32- ST1B, ST1H, ST1W 4 1/2 L01, V01 -
bit element size
Scatter store vector + imm 64- ST1B, ST1D, ST1H, 2 1 L01, V01 -
bit element size ST1W
Scatter store, 32-bit scaled offset ST1H, ST1W 4 1/2 L01, V01 -
Scatter store, 32-bit unscaled ST1B, ST1H, ST1W 4 1/2 L01, V01 -
offset
Scatter store, 64-bit scaled offset ST1D, ST1H, ST1W 2 1 L01, V01 -
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 53 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Notes:
1. When destination is same as the governing predicate, the latency of the instruction increases by one cycle.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 54 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
4 Special considerations
4.1 Dispatch constraints
Dispatch of µOPs from the in-order portion to the out-of-order portion of the microarchitecture
includes several constraints. It is important to consider these constraints during code generation
to maximize the effective dispatch bandwidth and subsequent execution bandwidth of Cortex-
X2.
The dispatch stage can process up to 8 MOPs per cycle and dispatch up to 16 µOPs per cycle,
with the following limitations on the number of µOPs of each type that may be simultaneously
dispatched.
In the event there are more µOPs available to be dispatched in a given cycle than can be
supported by the constraints above, µOPs will be dispatched in oldest to youngest age-order to
the extent allowed by the above.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 55 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
Unroll the loop to include multiple load and store operations per iteration, minimizing the
overheads of looping.
Use non-writeback forms of LDP and STP instructions interleaving them like shown in the
example below:
Loop_start:
SUBS x2,x2,#96
LDP q3,q4,[x1,#0]
STP q3,q4,[x0,#0]
LDP q3,q4,[x1,#32]
STP q3,q4,[x0,#32]
LDP q3,q4,[x1,#64]
STP q3,q4,[x0,#64]
ADD x1,x1,#96
ADD x0,x0,#96
BGT Loop_start
If the memory locations being copied are non-cacheable, the non-temporal version of LDPQ
(LDNPQ) should be used. STPQ should still be used for the stores.
Unroll the loop to include multiple store operations per iteration, minimizing the overheads of
looping.
Loop_start:
STP q1,q3,[x0,#0]
STP q1,q3,[x0,#0x20]
STP q1,q3,[x0,#0x40]
STP q1,q3,[x0,#0x60]
ADD x0,x0,#0x80
SUBS x2,x2,#0x80
B.GT Loop_start
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 56 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
To achieve maximum performance on memset to zero, it is recommended that one use DC ZVA
instead of STP. An optimal routine might look something like the following.
Loop_start:
SUBS x2,x2,#0x80
DC ZVA,x0
ADD x0,x0,#0x40
DC ZVA,x0
ADD x0,x0,#0x40
B.GT Loop_start
Pairs of dependent AESE/AESMC and AESD/AESIMC instructions are higher performance when
they are adjacent in the program code and both instructions use the same destination register.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 57 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
1 ASIMD/SVE integer ALU, ASIMD/SVE integer shift, ASIMD/scalar insert and move, 1
ASIMD/SVE integer abs/cmp/max/min and the ASIMD miscellaneous instructions
in table 3-18.
4 ASIMD/SVE AES, ASIMD/SVE polynomial multiply and all the instruction types in 1
region 1.
Notes:
1. Reciprocal step and estimate instructions are excluded from this region.
2. ASIMD/SVE extract narrow, saturating instructions are excluded from this region.
3. ASIMD miscellaneous instructions can only be consumers of this region.
In addition to the regions mentioned in the table above, all instructions in regions 1 and 2 can
fast forward to FP/ASIMD/SVE stores, FP/ASIMD vector to integer register transfers and ASIMD
converts that write to general purpose registers.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 58 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
For best case performance, avoid placing more than four branch instructions within an
aligned 32-byte instruction memory region.
In-Order Execution – Instructions must execute in-order with respect to other similar instructions
or in some cases all instructions.
Flush Side-Effects – Instructions trigger a flush side-effect after executing for synchronization.
The table below summarizes various special-purpose register read accesses and the associated
execution constraints or side-effects.
CurrentEL No Yes No -
DAIF No Yes No -
DLR_EL0 No Yes No -
DSPSR_EL0 No Yes No -
ELR_* No Yes No -
FPCR No Yes No -
NZCV No No No 1
SP_* No No No 1
SPSel No Yes No -
SPSR_* No Yes No -
FFR No Yes No -
Notes:
1. The NZCV and SP registers are fully renamed.
2. FPSR/FPSCR reads must wait for all prior instructions that may update the status flags to execute and retire.
3. APSR reads must wait for all prior instructions that may set the Q bit to execute and retire.
The table below summarizes various special-purpose register write accesses and the associated
execution constraints or side-effects.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 60 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
NZCV No No No 1
SP_* No No No 1
Notes:
1. The NZCV and SP registers are fully renamed.
2. If the FPCR/FPSCR write is predicted to change the control field values, it will introduce a barrier which prevents
subsequent instructions from executing. If the FPCR/FPSCR write is predicted to not change the control field values, it
will execute without a barrier but trigger a flush if the values change.
3. FPSR/FPSCR writes must stall at dispatch if another FPSR/FPSCR write is still pending.
4. APSR writes that set the Q bit will introduce a barrier which prevents subsequent instructions from executing until
the write completes.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 61 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
These instruction pairs must be adjacent to each other in program code. For CMP, CMN, TST and
BICS, fusion is not allowed for shifted and/or extended register forms. For BICS, the destination
register should be XZR or WZR if fusion is to take place.
MOV Xd, #0
MOV Wd, #0
MOVI Dd, #0
MOVI Vd.2D, #0
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 62 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
MOV Wd, Wn
MOV Xd, Xn
The last 2 instructions may not be executed with zero latency under certain conditions.
Unroll the loop to include multiple store operations per iteration, minimizing the overheads of
looping. Use STGM (or DCGVA) instruction as shown in the example below:
Loop_start:
SUBS x2,x2,#0x80
STGM x1,[x0]
ADD x0,x0,#0x40
STGM x1,[x0]
ADD x0,x0,#0x40
B.GT Loop_start
To achieve maximum throughput for tag and zeroing out data, it is recommended that one do
the following.
Unroll the loop to include multiple store operations per iteration, minimizing the overheads of
looping. Use STZGM (or DCZGVA) instruction as shown in the example below:
Loop_start:
SUBS x2,x2,#0x80
STZGM x1,[x0]
ADD x0,x0,#0x40
STZGM x1,[x0]
ADD x0,x0,#0x40
B.GT Loop_start
To achieve maximum throughput for tag-loading, it is recommended that one do the following.
Unroll the loop to include multiple load operations per iteration, minimizing the overheads of
looping. Use LDGM instruction as shown in the example below:
Loop_start:
SUBS x2,x2,#0x80
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 63 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
LDGM x1,[x0]
ADD x0,x0,#0x40
LDGM x1,[x0]
ADD x0,x0,#0x40
B.GT Loop_start
Also, it is recommended to use STZGM (or DCZGVA) to set tag if data is not a concern.
ASIMD
LD4, single 4-element structure, post indexed addressing mode, element size = 64b.
ST4, multiple 4-element structures, quad form, element size less than 64b.
ST4, multiple 4-element structures, quad form, element size = 64b, post indexed addressing
mode.
SVE
LD1B gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b unscaled offset.
LD1H gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b scaled or unscaled offset.
LD1W gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b scaled or unscaled offset.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 64 of 65
Arm® Cortex®-X2 Core Software Optimization Guide PJDOC-466751330-14955
Issue 4.0
LDFF1B gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b unscaled offset.
LDFF1H gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b scaled or unscaled offset.
LDFF1W gather (scalar + vector addressing) where vector index register is the same as the
destination register and element size = 32. Addressing mode is 32b scaled or unscaled offset.
Copyright © 2021 Arm Limited (or its affiliates). All rights reserved.
Non-Confidential
Page 65 of 65