SDM Change Document
SDM Change Document
Documentation Changes
October 2024
Notice: The Intel® 64 and IA-32 architectures may contain design defects or errors known as errata
that may cause the product to deviate from published specifications. Current characterized errata are
documented in the specification updates.
Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Summary Tables of Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Documentation Changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Revision History
• Updated title.
-013 • There are no Documentation Changes for this revision of the July 2005
document.
Affected Documents
Document Number/
Document Title
Location
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture 253665
®
Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A: Instruction Set
253666
Reference, A-L
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B: Instruction Set
253667
Reference, M-U
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2C: Instruction Set
326018
Reference, V
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2D: Instruction Set
334569
Reference, W-Z
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A: System
253668
Programming Guide, Part 1
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B: System
253669
Programming Guide, Part 2
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3C: System
326019
Programming Guide, Part 3
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3D: System
332831
Programming Guide, Part 4
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4: Model Specific
335592
Registers
Nomenclature
Documentation Changes include typos, errors, or omissions from the current published specifications. These
will be incorporated in any new release of the specification.
Documentation Changes(Sheet 1 of 2)
No. DOCUMENTATION CHANGES
------------------------------------------------------------------------------------------
Changes to this chapter:
• Added Intel® Xeon® 6 E-core, Intel® Xeon® 6 P-core, and Intel® Series 2 Core™ Ultra processor information
to Section 1.1, “Intel® 64 and IA-32 Processors Covered in this Manual.”
• Updated Section 1.2, “Overview of Volume 1: Basic Architecture,” with the newly added Chapter 16, and
renumbered the remaining chapters in the volume.
The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture (order number
253665) is part of a set that describes the architecture and programming environment of Intel® 64 and IA-32
architecture processors. Other volumes in this set are:
• The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B, 2C & 2D: Instruction Set
Reference (order numbers 253666, 253667, 326018, and 334569).
• The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 3A, 3B, 3C & 3D: System
Programming Guide (order numbers 253668, 253669, 326019, and 332831).
• The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4: Model-Specific Registers (order
number 335592).
The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, describes the basic architecture
and programming environment of Intel 64 and IA-32 processors. The Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volumes 2A, 2B, 2C, & 2D, describe the instruction set of the processor and the opcode struc-
ture. These volumes apply to application programmers and to programmers who write operating systems or exec-
utives. The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 3A, 3B, 3C, & 3D, describe
the operating-system support environment of Intel 64 and IA-32 processors. These volumes target operating-
system and BIOS designers. In addition, the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3B, addresses the programming environment for classes of software that host operating systems. The
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4, describes the model-specific registers
of Intel 64 and IA-32 processors.
Vol. 1 1-1
ABOUT THIS MANUAL
1-2 Vol. 1
ABOUT THIS MANUAL
Vol. 1 1-3
ABOUT THIS MANUAL
The Intel® Xeon® processor E5-2600/1600 v3 product families and the Intel® Core™ i7-59xx Processor Extreme
Edition are based on the Haswell-E microarchitecture and support Intel 64 architecture.
The Intel Atom® processor Z8000 series is based on the Airmont microarchitecture.
The Intel Atom® processor Z3400 series and the Intel Atom® processor Z3500 series are based on the Silvermont
microarchitecture.
The Intel® Core™ M processor family, 5th generation Intel® Core™ processors, Intel® Xeon® processor D-1500
product family and the Intel® Xeon® processor E5 v4 family are based on the Broadwell microarchitecture and
support Intel 64 architecture.
The Intel® Xeon® Scalable Processor Family, Intel® Xeon® processor E3-1500m v5 product family and 6th gener-
ation Intel® Core™ processors are based on the Skylake microarchitecture and support Intel 64 architecture.
The 7th generation Intel® Core™ processors are based on the Kaby Lake microarchitecture and support Intel 64
architecture.
The Intel Atom® processor C series, the Intel Atom® processor X series, the Intel® Pentium® processor J series,
the Intel® Celeron® processor J series, and the Intel® Celeron® processor N series are based on the Goldmont
microarchitecture.
The Intel® Xeon Phi™ Processor 3200, 5200, 7200 Series is based on the Knights Landing microarchitecture and
supports Intel 64 architecture.
The Intel® Pentium® Silver processor series, the Intel® Celeron® processor J series, and the Intel® Celeron®
processor N series are based on the Goldmont Plus microarchitecture.
The 8th generation Intel® Core™ processors, 9th generation Intel® Core™ processors, and Intel® Xeon® E proces-
sors are based on the Coffee Lake microarchitecture and support Intel 64 architecture.
The Intel® Xeon Phi™ Processor 7215, 7285, 7295 Series is based on the Knights Mill microarchitecture and
supports Intel 64 architecture.
The 2nd generation Intel® Xeon® Scalable Processor Family is based on the Cascade Lake product and supports
Intel 64 architecture.
Some 10th generation Intel® Core™ processors are based on the Ice Lake microarchitecture, and some are based
on the Comet Lake microarchitecture; both support Intel 64 architecture.
Some 11th generation Intel® Core™ processors are based on the Tiger Lake microarchitecture, and some are
based on the Rocket Lake microarchitecture; both support Intel 64 architecture.
Some 3rd generation Intel® Xeon® Scalable Processor Family processors are based on the Cooper Lake product,
and some are based on the Ice Lake microarchitecture; both support Intel 64 architecture.
The 12th generation Intel® Core™ processors are based on the Alder Lake performance hybrid architecture and
support Intel 64 architecture.
The 13th generation Intel® Core™ processors are based on the Raptor Lake performance hybrid architecture and
support Intel 64 architecture.
The 4th generation Intel® Xeon® Scalable Processor Family is based on Sapphire Rapids microarchitecture and
supports Intel 64 architecture.
The 5th generation Intel® Xeon® Scalable Processor Family is based on Emerald Rapids microarchitecture and
supports Intel 64 architecture.
The Intel® Core™ Ultra 7 processor is based on Meteor Lake performance hybrid architecture and supports Intel 64
architecture.
The Intel® Xeon® 6 E-core processor is based on Sierra Forest microarchitecture and supports Intel 64 architec-
ture.
The Intel® Xeon® 6 P-core processor is based on Granite Rapids microarchitecture and supports Intel 64 architec-
ture.
The Intel® Series 2 Core™ Ultra processor is based on Lunar Lake performance hybrid architecture and supports
Intel 64 architecture.
1-4 Vol. 1
ABOUT THIS MANUAL
IA-32 architecture is the instruction set architecture and programming environment for Intel's 32-bit microproces-
sors. Intel® 64 architecture is the instruction set architecture and programming environment which is the superset
of Intel’s 32-bit and 64-bit architectures. It is compatible with the IA-32 architecture.
Vol. 1 1-5
ABOUT THIS MANUAL
Chapter 17 — Programming with Intel® Transactional Synchronization Extensions. Describes the instruc-
tion extensions that support lock elision techniques to improve the performance of multi-threaded software with
contended locks.
Chapter 18 — Control-flow Enforcement Technology. Provides an overview of the Control-flow Enforcement
Technology (CET) and gives guidelines for writing code that access these extensions.
Chapter 19 — Programming with Intel® Advanced Matrix Extensions. Provides an overview of the Intel®
Advanced Matrix Extensions and gives guidelines for writing code that access these extensions.
Chapter 20 — Input/Output. Describes the processor’s I/O mechanism, including I/O port addressing, I/O
instructions, and I/O protection mechanisms.
Chapter 21 — Processor Identification and Feature Determination. Describes how to determine the CPU
type and features available in the processor.
Appendix A — EFLAGS Cross-Reference. Summarizes how the IA-32 instructions affect the flags in the EFLAGS
register.
Appendix B — EFLAGS Condition Codes. Summarizes how conditional jump, move, and ‘byte set on condition
code’ instructions use condition code flags (OF, CF, ZF, SF, and PF) in the EFLAGS register.
Appendix C — Floating-Point Exceptions Summary. Summarizes exceptions raised by the x87 FPU floating-
point and SSE/SSE2/SSE3 floating-point instructions.
Appendix D — Guidelines for Writing SIMD Floating-Point Exception Handlers. Gives guidelines for writing
exception handlers for exceptions generated by SSE/SSE2/SSE3 floating-point instructions.
Appendix E — Intel® Memory Protection Extensions. Provides an overview of the Intel® Memory Protection
Extensions, a feature that has been deprecated and will not be available on future processors.
Byte Offset
1-6 Vol. 1
ABOUT THIS MANUAL
NOTE
Avoid any software dependence upon the state of reserved bits in Intel 64 and IA-32 registers.
Depending upon the values of reserved register bits will make software dependent upon the
unspecified manner in which the processor handles these bits. Programs that depend upon
reserved values risk incompatibility with future processors.
Vol. 1 1-7
ABOUT THIS MANUAL
Segment-register:Byte-address
For example, the following segment address identifies the byte at address FF79H in the segment pointed by the DS
register:
DS:FF79H
The following segment address identifies an instruction address in the code segment. The CS register points to the
code segment and the EIP register contains the address of the instruction.
CS:EIP
1-8 Vol. 1
ABOUT THIS MANUAL
CPUID.01H:EDX.SSE[bit 25] = 1
CR4.OSFXSR[bit 9] = 1
Example CR name
IA32_MISC_ENABLE.ENABLEFOPCODE[bit 2] = 1
SDM29002
Figure 1-2. Syntax for CPUID, CR, and MSR Data Presentation
1.3.6 Exceptions
An exception is an event that typically occurs when an instruction causes an error. For example, an attempt to
divide by zero generates an exception. However, some exceptions, such as breakpoints, occur under other condi-
tions. Some types of exceptions may provide error codes. An error code reports additional information about the
error. An example of the notation used to show an exception and error code is shown below:
#PF(fault code)
This example refers to a page-fault exception under conditions where an error code naming a type of fault is
reported. Under some conditions, exceptions that produce error codes may not be able to report an accurate code.
In this case, the error code is zero, as shown below for a general-protection exception:
#GP(0)
Vol. 1 1-9
ABOUT THIS MANUAL
1-10 Vol. 1
2. Updates to Chapter 2, Volume 1
Change bars and violet text show changes to Chapter 2 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1: Basic Architecture.
------------------------------------------------------------------------------------------
Changes to this chapter:
• Added Sub-Page Permissions to Table 2-4 in Section 2.4, “Planned Removal of Intel® Instruction Set Archi-
tecture and Features from Upcoming Products.”
Vol. 1 2-1
INTEL® 64 AND IA-32 ARCHITECTURES
2-2 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES
new set of 128-bit registers and the ability to perform SIMD operations on packed single precision floating-
point values. See Section 2.2.7, “SIMD Instructions.”
• The Pentium III Xeon processor extended the performance levels of the IA-32 processors with the
enhancement of a full-speed, on-die, and Advanced Transfer Cache.
Vol. 1 2-3
INTEL® 64 AND IA-32 ARCHITECTURES
2.1.11 The Intel® Core™ Duo and Intel® Core™ Solo Processors (2006—2007)
The Intel Core Duo processor offers power-efficient, dual-core performance with a low-power design that extends
battery life. This family and the single-core Intel Core Solo processor offer microarchitectural enhancements over
Pentium M processor family.
Its enhanced microarchitecture includes:
• Intel® Smart Cache which allows for efficient data sharing between two processor cores.
• Improved decoding and SIMD execution.
• Intel® Dynamic Power Coordination and Enhanced Intel® Deeper Sleep to reduce power consumption.
• Intel® Advanced Thermal Manager which features digital thermal sensor interfaces.
• Support for power-optimized 667 MHz bus.
The dual-core Intel Xeon processor LV is based on the same microarchitecture as Intel Core Duo processor, and
supports IA-32 architecture.
2.1.12 The Intel® Xeon® Processor 5100, 5300 Series, and Intel® Core™ 2 Processor Family
(2006)
The Intel Xeon processor 3000, 3200, 5100, 5300, and 7300 series, Intel Pentium Dual-Core, Intel Core 2 Extreme,
Intel Core 2 Quad processors, and Intel Core 2 Duo processor family support Intel 64 architecture; they are based
on the high-performance, power-efficient Intel® Core microarchitecture built on 65 nm process technology. The
Intel Core microarchitecture includes the following innovative features:
• Intel® Wide Dynamic Execution to increase performance and execution throughput.
• Intel® Intelligent Power Capability to reduce power consumption.
• Intel® Advanced Smart Cache which allows for efficient data sharing between two processor cores.
• Intel® Smart Memory Access to increase data bandwidth and hide latency of memory accesses.
• Intel® Advanced Digital Media Boost which improves application performance using multiple generations of
Streaming SIMD extensions.
The Intel Xeon processor 5300 series, Intel Core 2 Extreme processor QX6800 series, and Intel Core 2 Quad
processors support Intel quad-core technology.
2.1.13 The Intel® Xeon® Processor 5200, 5400, 7400 Series, and Intel® Core™ 2 Processor
Family (2007)
The Intel Xeon processor 5200, 5400, and 7400 series, Intel Core 2 Quad processor Q9000 Series, Intel Core 2 Duo
processor E8000 series support Intel 64 architecture; they are based on the Enhanced Intel® Core microarchitec-
2-4 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES
ture using 45 nm process technology. The Enhanced Intel Core microarchitecture provides the following improved
features:
• A radix-16 divider, faster OS primitives further increases the performance of Intel® Wide Dynamic Execution.
• Improves Intel® Advanced Smart Cache with Up to 50% larger level-two cache and up to 50% increase in way-
set associativity.
• A 128-bit shuffler engine significantly improves the performance of Intel® Advanced Digital Media Boost and
SSE4.
The Intel Xeon processor 5400 series and the Intel Core 2 Quad processor Q9000 Series support Intel quad-core
technology. The Intel Xeon processor 7400 series offers up to six processor cores and an L3 cache up to 16 MBytes.
2.1.15 The Intel Atom® Processor Family Based on Silvermont Microarchitecture (2013)
Intel Atom Processor C2xxx, E3xxx, S1xxx series are based on the Silvermont microarchitecture. Processors based
on the Silvermont microarchitecture support instruction set extensions up to and including SSE4.2, AESNI, and
PCLMULQDQ.
Vol. 1 2-5
INTEL® 64 AND IA-32 ARCHITECTURES
2-6 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES
System Bus
Frequently used
Bus Unit Less frequently used
Front End
Execution
Instruction Execution
Fetch/
Cache Out-of-Order Retirement
Decode
Microcode Core
ROM
OM16520
Figure 2-1. The P6 Processor Microarchitecture with Advanced Transfer Cache Enhancement
To ensure a steady supply of instructions and data for the instruction execution pipeline, the P6 processor microar-
chitecture incorporates two cache levels. The Level 1 cache provides an 8-KByte instruction cache and an 8-KByte
data cache, both closely coupled to the pipeline. The Level 2 cache provides 256-KByte, 512-KByte, or 1-MByte
static RAM that is coupled to the core processor through a full clock-speed 64-bit cache bus.
The centerpiece of the P6 processor microarchitecture is an out-of-order execution mechanism called dynamic
execution. Dynamic execution incorporates three data-processing concepts:
Vol. 1 2-7
INTEL® 64 AND IA-32 ARCHITECTURES
• Deep branch prediction allows the processor to decode instructions beyond branches to keep the instruction
pipeline full. The P6 processor family implements highly optimized branch prediction algorithms to predict the
direction of the instruction.
• Dynamic data flow analysis requires real-time analysis of the flow of data through the processor to
determine dependencies and to detect opportunities for out-of-order instruction execution. The out-of-order
execution core can monitor many instructions and execute these instructions in the order that best optimizes
the use of the processor’s multiple execution units, while maintaining the data integrity.
• Speculative execution refers to the processor’s ability to execute instructions that lie beyond a conditional
branch that has not yet been resolved, and ultimately to commit the results in the order of the original
instruction stream. To make speculative execution possible, the P6 processor microarchitecture decouples the
dispatch and execution of instructions from the commitment of results. The processor’s out-of-order execution
core uses data-flow analysis to execute all available instructions in the instruction pool and store the results in
temporary registers. The retirement unit then linearly searches the instruction pool for completed instructions
that no longer have data dependencies with other instructions or unresolved branch predictions. When
completed instructions are found, the retirement unit commits the results of these instructions to memory
and/or the IA-32 registers (the processor’s eight general-purpose registers and eight x87 FPU data registers)
in the order they were originally issued and retires the instructions from the instruction pool.
1. Intel 64 and IA-32 processors based on the Intel NetBurst microarchitecture at 90 nm process can handle more than 24 stores in
flight.
2-8 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES
• High-performance, quad-pumped bus interface to the Intel NetBurst microarchitecture system bus.
— Supports quad-pumped, scalable bus clock to achieve up to 4X effective speed.
— Capable of delivering up to 8.5 GBytes of bandwidth per second.
• Superscalar issue to enable parallelism.
• Expanded hardware registers with renaming to avoid register name space limitations.
• 64-byte cache line size (transfers data up to two lines per sector).
Figure 2-2 is an overview of the Intel NetBurst microarchitecture. This microarchitecture pipeline is made up of
three sections: (1) the front end pipeline, (2) the out-of-order execution core, and (3) the retirement unit.
System Bus
Frequently used paths
Front End
Execution
Trace Cache
Fetch/Decode Out-Of-Order Retirement
Microcode ROM
Core
OM16521
Vol. 1 2-9
INTEL® 64 AND IA-32 ARCHITECTURES
• Wasted decode bandwidth due to branches or branch target in the middle of cache lines.
The operation of the pipeline’s trace cache addresses these issues. Instructions are constantly being fetched and
decoded by the translation engine (part of the fetch/decode logic) and built into sequences of micro-ops called
traces. At any time, multiple traces (representing prefetched branches) are being stored in the trace cache. The
trace cache is searched for the instruction that follows the active branch. If the instruction also appears as the first
instruction in a pre-fetched branch, the fetch and decode of instructions from the memory hierarchy ceases and the
pre-fetched branch becomes the new source of instructions (see Figure 2-2).
The trace cache and the translation engine have cooperating branch prediction hardware. Branch targets are
predicted based on their linear addresses using branch target buffers (BTBs) and fetched as soon as possible.
2-10 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES
• Intel® Smart Memory Access prefetches data from memory in response to data access patterns and reduces
cache-miss exposure of out-of-order execution.
— Hardware prefetchers to reduce effective latency of second-level cache misses.
— Hardware prefetchers to reduce effective latency of first-level data cache misses.
— Memory disambiguation to improve efficiency of speculative execution engine.
• Intel® Advanced Digital Media Boost improves most 128-bit SIMD instructions with single-cycle
throughput and floating-point operations.
— Single-cycle throughput of most 128-bit SIMD instructions.
— Up to eight floating-point operations per cycle.
— Three issue ports available to dispatching SIMD instructions for execution.
Intel Core 2 Extreme, Intel Core 2 Duo processors and Intel Xeon processor 5100 series implement two processor
cores based on the Intel Core microarchitecture, the functionality of the subsystems in each core are depicted in
Figure 2-3.
Instruction Q ueue
M icro-
code D ecode
ROM
S hared L2 C ache
R enam e/A lloc U p to 10.7 G B /s
FS B
S cheduler
A LU A LU A LU
B ranch FA dd FM ul Load S tore
M M X /S S E /FP M M X /S S E M M X/S S E
M ove
Vol. 1 2-11
INTEL® 64 AND IA-32 ARCHITECTURES
2-12 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES
Vol. 1 2-13
INTEL® 64 AND IA-32 ARCHITECTURES
2-14 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES
Intel 64 architecture allows four generations of 128-bit SIMD extensions to access up to 16 XMM registers. IA-32
architecture provides eight XMM registers.
Intel® Advanced Vector Extensions offers comprehensive architectural enhancements over previous generations of
Streaming SIMD Extensions. Intel AVX introduces the following architectural enhancements:
• Support for 256-bit wide vectors and SIMD register set.
• 256-bit floating-point instruction set enhancement with up to 2X performance gain relative to 128-bit
Streaming SIMD extensions.
• Instruction syntax support for generalized three-operand syntax to improve instruction programming flexibility
and efficient encoding of new instruction extensions.
• Enhancement of legacy 128-bit SIMD instruction extensions to support three operand syntax and to simplify
compiler vectorization of high-level language expressions.
• Support flexible deployment of 256-bit AVX code, 128-bit AVX code, legacy 128-bit code and scalar code.
In addition to performance considerations, programmers should also be cognizant of the implications of VEX-
encoded AVX instructions with the expectations of system software components that manage the processor state
components enabled by XCR0. For additional information see Section 2.3.10.1, “Vector Length Transition and
Programming Considerations” in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A.
See also:
• Section 5.4, “MMX Instructions,” and Chapter 9, “Programming with Intel® MMX™ Technology.”
• Section 5.5, “Intel® SSE Instructions,” and Chapter 10, “Programming with Intel® Streaming SIMD Extensions
(Intel® SSE).”
• Section 5.6, “Intel® SSE2 Instructions,” and Chapter 11, “Programming with Intel® Streaming SIMD
Extensions 2 (Intel® SSE2).”
• Section 5.7, “Intel® SSE3 Instructions,” Section 5.8, “Supplemental Streaming SIMD Extensions 3 (SSSE3)
Instructions,” Section 5.9, “Intel® SSE4 Instructions,” and Chapter 12, “Programming with Intel® SSE3,
SSSE3, Intel® SSE4, and Intel® AES-NI.”
Vol. 1 2-15
INTEL® 64 AND IA-32 ARCHITECTURES
MMX Registers
MMX Technology - SSSE3 8 Packed Byte Integers
4 Packed Word Integers
Quadword
SSE - AVX
XMM Registers
4 Packed Single Precision
Floating-Point Values
2 Packed Double Precision
Floating-Point Values
16 Packed Byte Integers
2 Quadword Integers
Double Quadword
AVX
YMM Registers
8 Packed SP FP Values
4 Packed DP FP Values
2 128-bit Data
2-16 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES
AS AS AS AS
1. In the remainder of this document, the term “thread” will be used as a general term for the terms “process” and “thread.”
Vol. 1 2-17
INTEL® 64 AND IA-32 ARCHITECTURES
OM19809
System Bus
The Pentium® dual-core processor is based on the same technology as the Intel Core 2 Duo processor family.
The Intel Xeon processor 7300, 5300, and 3200 series, Intel Core 2 Extreme Quad-Core processor, and Intel Core
2 Quad processors support Intel quad-core technology. The Quad-core Intel Xeon processors and the Quad-Core
Intel Core 2 processor family are also in Figure 2-7.
2-18 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES
System Bus
OM19810
Intel Core i7 processors support Intel quad-core technology, Intel HyperThreading Technology, provides Intel
QuickPath interconnect link to the chipset and have integrated memory controller supporting three channels to
DDR3 memory.
IMC
QPI
DDR3
Chipset
OM19810b
Vol. 1 2-19
INTEL® 64 AND IA-32 ARCHITECTURES
2-20 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES
The key features of the Intel Pentium 4 processor, Intel Xeon processor, Intel Xeon processor MP, Pentium III
processor, and Pentium III Xeon processor with advanced transfer cache are shown in Table 2-1. Older generation
IA-32 processors, which do not employ on-die Level 2 cache, are shown in Table 2-2.
Table 2-1. Key Features of Most Recent IA-32 Processors
Intel Date Microarchitecture Top-Bin Clock Tran- Register System Max. On-Die
Processor Intro- Frequency at sistors Sizes1 Bus Band- Extern. Caches2
duced Introduction width Addr.
Space
Intel Pentium 2004 Intel Pentium M 2.00 GHz 140 M GP: 32 3.2 GB/s 4 GB L1: 64 KB
M Processor FPU: 80 L2: 2 MB
Processor MMX: 64
7553 XMM: 128
Intel Core Duo 2006 Improved Intel 2.16 GHz 152 M GP: 32 5.3 GB/s 4 GB L1: 64 KB
Processor Pentium M FPU: 80 L2: 2 MB (2
T2600 3 Processor MMX: 64 MB Total)
Microarchitecture; XMM: 128
Dual Core;
Intel Smart Cache,
Advanced Thermal
Manager
Intel Atom 2008 Intel Atom 1.86 GHz - 47 M GP: 32 Up to 4.2 4 GB L1: 56 KB4
Processor Microarchitecture; 800 MHz FPU: 80 GB/s L2: 512 KB
Z5xx series Intel Virtualization MMX: 64
Technology. XMM: 128
NOTES:
1. The register size and external data bus size are given in bits.
2. First level cache is denoted using the abbreviation L1, 2nd level cache is denoted as L2. The size of L1 includes the first-level data
cache and the instruction cache where applicable, but does not include the trace cache.
3. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family,
not across different processor families. See http://www.intel.com/products/processor_number for details.
4. In Intel Atom Processor, the size of L1 instruction cache is 32 KBytes, L1 data cache is 24 KBytes.
Vol. 1 2-21
INTEL® 64 AND IA-32 ARCHITECTURES
2-22 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES
Vol. 1 2-23
INTEL® 64 AND IA-32 ARCHITECTURES
2-24 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES
Vol. 1 2-25
INTEL® 64 AND IA-32 ARCHITECTURES
2-26 Vol. 1
INTEL® 64 AND IA-32 ARCHITECTURES
Vol. 1 2-27
INTEL® 64 AND IA-32 ARCHITECTURES
NOTES:
1. The register size and external data bus size are given in bits. Note also that each 32-bit general-purpose (GP) registers can be
addressed as an 8- or a 16-bit data registers in all of the processors.
2. Internal data paths are 2 to 4 times wider than the external data bus for each processor.
2-28 Vol. 1
3. Updates to Chapter 5, Volume 1
Change bars and violet text show changes to Chapter 5 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1: Basic Architecture.
------------------------------------------------------------------------------------------
Changes to this chapter:
• Updated Table 5-2, “Instruction Set Extensions Introduction in Intel® 64 and IA-32 Processors,” with the ISA
features that have moved into the Intel® 64 and IA-32 Architectures Software Developer’s Manuals in this
release.
• Added Section 5.31, “Intel® AVX10.1 Instructions.”
This chapter provides an abridged overview of Intel 64 and IA-32 instructions. Instructions are divided into the
following groups:
• Section 5.1, “General-Purpose Instructions.”
• Section 5.2, “x87 FPU Instructions.”
• Section 5.3, “x87 FPU AND SIMD State Management Instructions.”
• Section 5.4, “MMX Instructions.”
• Section 5.5, “Intel® SSE Instructions.”
• Section 5.6, “Intel® SSE2 Instructions.”
• Section 5.7, “Intel® SSE3 Instructions.”
• Section 5.8, “Supplemental Streaming SIMD Extensions 3 (SSSE3) Instructions.”
• Section 5.9, “Intel® SSE4 Instructions.”
• Section 5.10, “Intel® SSE4.1 Instructions.”
• Section 5.11, “Intel® SSE4.2 Instruction Set.”
• Section 5.12, “Intel® AES-NI and PCLMULQDQ.”
• Section 5.13, “Intel® Advanced Vector Extensions (Intel® AVX).”
• Section 5.14, “16-bit Floating-Point Conversion.”
• Section 5.15, “Fused-Multiply-ADD (FMA).”
• Section 5.16, “Intel® Advanced Vector Extensions 2 (Intel® AVX2).”
• Section 5.17, “Intel® Transactional Synchronization Extensions (Intel® TSX).”
• Section 5.18, “Intel® SHA Extensions.”
• Section 5.19, “Intel® Advanced Vector Extensions 512 (Intel® AVX-512).”
• Section 5.20, “System Instructions.”
• Section 5.21, “64-Bit Mode Instructions.”
• Section 5.22, “Virtual-Machine Extensions.”
• Section 5.23, “Safer Mode Extensions.”
• Section 5.24, “Intel® Memory Protection Extensions.”
• Section 5.25, “Intel® Software Guard Extensions.”
• Section 5.26, “Shadow Stack Management Instructions.”
• Section 5.27, “Control Transfer Terminating Instructions.”
• Section 5.28, “Intel® AMX Instructions.”
• Section 5.29, “User Interrupt Instructions.”
• Section 5.30, “Enqueue Store Instructions.”
• Section 5.31, “Intel® Advanced Vector Extensions 10 Version 1 Instructions.”
Table 5-1 lists the groups and IA-32 processors that support each group. More recent instruction set extensions are
listed in Table 5-2. Within these groups, most instructions are collected into functional subgroups.
Vol. 1 5-1
INSTRUCTION SET SUMMARY
Table 5-2. Instruction Set Extensions Introduction in Intel® 64 and IA-32 Processors
Instruction Set Architecture Processor Generation Introduction
SSE4.1 Extensions Intel® Xeon® processor 3100, 3300, 5200, 5400, 7400, 7500 series, Intel® Core™ 2 Extreme
processors QX9000 series, Intel® Core™ 2 Quad processor Q9000 series, Intel® Core™ 2 Duo processors
8000 series and T9000 series, Intel Atom® processor based on Silvermont microarchitecture.
SSE4.2 Extensions, CRC32, Intel® Core™ i7 965 processor, Intel® Xeon® processors X3400, X3500, X5500, X6500, X7500 series,
POPCNT Intel Atom processor based on Silvermont microarchitecture.
Intel® AES-NI, PCLMULQDQ Intel® Xeon® processor E7 series, Intel® Xeon® processors X3600 and X5600, Intel® Core™ i7 980X
processor, Intel Atom processor based on Silvermont microarchitecture. Use CPUID to verify presence
of Intel AES-NI and PCLMULQDQ across Intel® Core™ processor families.
Intel® AVX Intel® Xeon® processor E3 and E5 families, 2nd Generation Intel® Core™ i7, i5, i3 processor 2xxx
families.
F16C 3rd Generation Intel® Core™ processors, Intel® Xeon® processor E3-1200 v2 product family, Intel®
Xeon® processor E5 v2 and E7 v2 families.
RDRAND 3rd Generation Intel Core processors, Intel Xeon processor E3-1200 v2 product family, Intel Xeon
processor E5 v2 and E7 v2 families, Intel Atom processor based on Silvermont microarchitecture.
FS/GS base access 3rd Generation Intel Core processors, Intel Xeon processor E3-1200 v2 product family, Intel Xeon
processor E5 v2 and E7 v2 families, Intel Atom® processor based on Goldmont microarchitecture.
5-2 Vol. 1
INSTRUCTION SET SUMMARY
Table 5-2. Instruction Set Extensions Introduction in Intel® 64 and IA-32 Processors (Contd.)
Instruction Set Architecture Processor Generation Introduction
FMA, AVX2, BMI1, BMI2, Intel® Xeon® processor E3/E5/E7 v3 product families, 4th Generation Intel® Core™ processor family.
INVPCID, LZCNT, Intel® TSX
MOVBE Intel Xeon processor E3/E5/E7 v3 product families, 4th Generation Intel Core processor family, Intel
Atom processors.
PREFETCHW Intel® Core™ M processor family; 5th Generation Intel® Core™ processor family, Intel Atom processor
based on Silvermont microarchitecture.
ADX Intel Core M processor family, 5th Generation Intel Core processor family.
RDSEED, CLAC, STAC Intel Core M processor family, 5th Generation Intel Core processor family, Intel Atom processor based
on Goldmont microarchitecture.
AVX512ER, AVX512PF, Intel® Xeon Phi™ Processor 3200, 5200, 7200 Series.
PREFETCHWT1
AVX512F, AVX512CD Intel Xeon Phi Processor 3200, 5200, 7200 Series, Intel® Xeon® Scalable Processor Family, Intel® Core™
i3-8121U processor.
CLFLUSHOPT, XSAVEC, Intel Xeon Scalable Processor Family, 6th Generation Intel® Core™ processor family, Intel Atom
XSAVES, Intel® MPX processor based on Goldmont microarchitecture.
SGX1 6th Generation Intel Core processor family, Intel Atom® processor based on Goldmont Plus
microarchitecture.
AVX512DQ, AVX512BW, Intel Xeon Scalable Processor Family, Intel Core i3-8121U processor based on Cannon Lake
AVX512VL microarchitecture.
CLWB Intel Xeon Scalable Processor Family, Intel Atom® processor based on Tremont microarchitecture, 11th
Generation Intel Core processor family based on Tiger Lake microarchitecture.
PKU Intel Xeon Scalable Processor Family, 10th generation Intel® Core™ processors based on Comet Lake
microarchitecture.
AVX512_IFMA, Intel Core i3-8121U processor based on Cannon Lake microarchitecture.
AVX512_VBMI
Intel® SHA Extensions Intel Core i3-8121U processor based on Cannon Lake microarchitecture, Intel Atom processor based
on Goldmont microarchitecture, 3rd Generation Intel® Xeon® Scalable Processor Family based on Ice
Lake microarchitecture.
UMIP Intel Core i3-8121U processor based on Cannon Lake microarchitecture, Intel Atom processor based
on Goldmont Plus microarchitecture.
PTWRITE Intel Atom processor based on Goldmont Plus microarchitecture, 12th generation Intel® Core™
processor supporting Alder Lake performance hybrid architecture, 4th generation Intel® Xeon®
Scalable Processor Family based on Sapphire Rapids microarchitecture.
RDPID 10th Generation Intel® Core™ processor family based on Ice Lake microarchitecture, Intel Atom
processor based on Goldmont Plus microarchitecture.
AVX512_4FMAPS, Intel® Xeon Phi™ Processor 7215, 7285, 7295 Series.
AVX512_4VNNIW
AVX512_VNNI 2nd Generation Intel® Xeon® Scalable Processor Family, 10th Generation Intel Core processor family
based on Ice Lake microarchitecture.
AVX512_VPOPCNTDQ Intel Xeon Phi Processor 7215, 7285, 7295 Series, 10th Generation Intel Core processor family based
on Ice Lake microarchitecture.
Fast Short REP MOV 10th Generation Intel Core processor family based on Ice Lake microarchitecture.
GFNI (SSE) 10th Generation Intel Core processor family based on Ice Lake microarchitecture, Intel Atom processor
based on Tremont microarchitecture.
Vol. 1 5-3
INSTRUCTION SET SUMMARY
Table 5-2. Instruction Set Extensions Introduction in Intel® 64 and IA-32 Processors (Contd.)
Instruction Set Architecture Processor Generation Introduction
VAES, GFNI (AVX/AVX512), 10th Generation Intel Core processor family based on Ice Lake microarchitecture.
AVX512_VBMI2,
VPCLMULQDQ,
AVX512_BITALG
ENCLV Future processors.
Split Lock Detection 10th Generation Intel Core processor family based on Ice Lake microarchitecture, Intel Atom processor
based on Tremont microarchitecture.
CLDEMOTE Intel Atom processor based on Tremont microarchitecture, 4th generation Intel® Xeon® Scalable
Processor Family based on Sapphire Rapids microarchitecture.
Direct stores: MOVDIRI, Intel Atom processor based on Tremont microarchitecture, 11th Generation Intel Core processor
MOVDIR64B family based on Tiger Lake microarchitecture, 4th generation Intel® Xeon® Scalable Processor Family
based on Sapphire Rapids microarchitecture.
User wait: TPAUSE, Intel Atom processor based on Tremont microarchitecture, 12th generation Intel Core processor based
UMONITOR, UMWAIT on Alder Lake performance hybrid architecture, 4th generation Intel® Xeon® Scalable Processor Family
based on Sapphire Rapids microarchitecture.
AVX512_BF16 3rd Generation Intel® Xeon® Scalable Processor Family based on Cooper Lake product, 4th generation
Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture.
AVX512_VP2INTERSECT 11th Generation Intel Core processor family based on Tiger Lake microarchitecture. (Not currently
supported in any other processors).
Key Locker1 11th Generation Intel Core processor family based on Tiger Lake microarchitecture, 12th generation
Intel Core processor supporting Alder Lake performance hybrid architecture.
Control-flow Enforcement 11th Generation Intel Core processor family based on Tiger Lake microarchitecture, 4th generation
Technology (CET) Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture, Intel® Xeon® 6 E-
core processors based on Sierra Forest microarchitecture.
TME-MK2, PCONFIG 3rd Generation Intel® Xeon® Scalable Processor Family based on Ice Lake microarchitecture.
WBNOINVD 3rd Generation Intel® Xeon® Scalable Processor Family based on Ice Lake microarchitecture.
LBRs (architectural) 12th generation Intel Core processor supporting Alder Lake performance hybrid architecture, 4th
generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture, Intel®
Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
Intel® Virtualization 12th generation Intel Core processor supporting Alder Lake performance hybrid architecture, 4th
Technology - Redirect generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture, Intel®
Protection (Intel® VT-rp) and Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
HLAT
AVX-VNNI 12th generation Intel Core processor supporting Alder Lake performance hybrid architecture3, 4th
generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture, Intel®
Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
SERIALIZE 12th generation Intel Core processor supporting Alder Lake performance hybrid architecture, 4th
generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture, Intel®
Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
Intel® Thread Director and 12th generation Intel Core processor supporting Alder Lake performance hybrid architecture.
HRESET
Fast zero-length REP MOVSB, 12th generation Intel Core processor supporting Alder Lake performance hybrid architecture, 4th
fast short REP STOSB generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture.
Fast Short REP CMPSB, fast 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture.
short REP SCASB
5-4 Vol. 1
INSTRUCTION SET SUMMARY
Table 5-2. Instruction Set Extensions Introduction in Intel® 64 and IA-32 Processors (Contd.)
Instruction Set Architecture Processor Generation Introduction
Supervisor Memory 12th generation Intel Core processor supporting Alder Lake performance hybrid architecture, 4th
Protection Keys (PKS) generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture, Intel®
Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
Attestation Services for 3rd Generation Intel® Xeon® Scalable Processor Family based on Ice Lake microarchitecture.
Intel® SGX
Enqueue Stores: ENQCMD 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture,
and ENQCMDS Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
Intel® TSX Suspend Load 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture.
Address Tracking
(TSXLDTRK)
Intel® Advanced Matrix 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture.
Extensions (Intel® AMX)
Includes CPUID Leaf 1EH,
“TMUL Information Main
Leaf”, and CPUID bits AMX-
BF16, AMX-TILE, and AMX-
INT8.
User Interrupts (UINTR) 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture,
Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra processor
supporting Lunar Lake performance hybrid architecture.
IPI Virtualization 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture,
Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra processor
supporting Lunar Lake performance hybrid architecture.
AVX512-FP16, for the FP16 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture.
Data Type
Virtualization of guest 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture,
accesses to Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
IA32_SPEC_CTRL
Linear Address Masking Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra processor
(LAM) supporting Lunar Lake performance hybrid architecture.
Linear Address Space Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra processor
Separation (LASS) supporting Lunar Lake performance hybrid architecture.
PREFETCHIT0/1 Intel® Xeon® 6 P-core processors based on Granite Rapids microarchitecture.
AMX-FP16 Intel® Xeon® 6 P-core processors based on Granite Rapids microarchitecture.
CMPCCXADD Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra
processor supporting Lunar Lake performance hybrid architecture.
AVX-IFMA Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra
processor supporting Lunar Lake performance hybrid architecture.
AVX-NE-CONVERT Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra processor
supporting Lunar Lake performance hybrid architecture.
AVX-VNNI-INT8 Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra
processor supporting Lunar Lake performance hybrid architecture.
AVX-VNNI-INT16 Intel® Core™ Ultra processor supporting Lunar Lake performance hybrid architecture.
SHA512 Intel® Core™ Ultra processor supporting Lunar Lake performance hybrid architecture.
SM3 Intel® Core™ Ultra processor supporting Lunar Lake performance hybrid architecture.
SM4 Intel® Core™ Ultra processor supporting Lunar Lake performance hybrid architecture.
Vol. 1 5-5
INSTRUCTION SET SUMMARY
Table 5-2. Instruction Set Extensions Introduction in Intel® 64 and IA-32 Processors (Contd.)
Instruction Set Architecture Processor Generation Introduction
RDMSRLIST, WRMSRLIST, Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
and WRMSRNS
UC Lock Disable Causes #AC Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture.
LBR Event Logging Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra processor
supporting Lunar Lake performance hybrid architecture.
UIRET flexibly updates UIF Intel® Xeon® 6 E-core processors based on Sierra Forest microarchitecture, Intel® Core™ Ultra
processor supporting Lunar Lake performance hybrid architecture.
Intel® Advanced Vector Intel® Xeon® 6 P-core processors based on Granite Rapids microarchitecture.
Extensions 10 Version 1
(Intel® AVX10.1)
NOTES:
1. Details on Key Locker can be found in the Intel Key Locker Specification here:
https://software.intel.com/content/www/us/en/develop/download/intel-key-locker-specification.html.
2. Further details on TME-MK usage can be found here:
https://software.intel.com/sites/default/files/managed/a5/16/Multi-Key-Total-Memory-Encryption-Spec.pdf.
3. Alder Lake performance hybrid architecture does not support Intel® AVX-512. ISA features such as Intel® AVX, AVX-VNNI, Intel® AVX2,
and UMONITOR/UMWAIT/TPAUSE are supported.
The following sections list instructions in each major group and subgroup. Given for each instruction is its
mnemonic and descriptive names. When two or more mnemonics are given (for example, CMOVA/CMOVNBE), they
represent different mnemonics for the same instruction opcode. Assemblers support redundant mnemonics for
some instructions to make it easier to read code listings. For instance, CMOVA (Conditional move if above) and
CMOVNBE (Conditional move if not below or equal) represent the same condition. For detailed information about
specific instructions, see the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B, 2C,
& 2D.
5-6 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-7
INSTRUCTION SET SUMMARY
CMP Compare.
5-8 Vol. 1
INSTRUCTION SET SUMMARY
SETAE/SETNB/SETNC Set byte if above or equal/Set byte if not below/Set byte if not carry.
SETB/SETNAE/SETC Set byte if below/Set byte if not above or equal/Set byte if carry.
SETBE/SETNA Set byte if below or equal/Set byte if not above.
SETG/SETNLE Set byte if greater/Set byte if not less or equal.
SETGE/SETNL Set byte if greater or equal/Set byte if not less.
SETL/SETNGE Set byte if less/Set byte if not greater or equal.
SETLE/SETNG Set byte if less or equal/Set byte if not greater.
SETS Set byte if sign (negative).
SETNS Set byte if not sign (non-negative).
SETO Set byte if overflow.
SETNO Set byte if not overflow.
SETPE/SETP Set byte if parity even/Set byte if parity.
SETPO/SETNP Set byte if parity odd/Set byte if not parity.
TEST Logical compare.
CRC321 Provides hardware acceleration to calculate cyclic redundancy checks for fast and efficient
implementation of data integrity protocols.
POPCNT2 Calculates of number of bits set to 1 in the second operand (source) and returns the count
in the first operand (a destination register).
Vol. 1 5-9
INSTRUCTION SET SUMMARY
LOOPZ/LOOPE Loop with ECX and zero/Loop with ECX and equal.
LOOPNZ/LOOPNE Loop with ECX and not zero/Loop with ECX and not equal.
CALL Call procedure.
RET Return.
IRET Return from interrupt.
INT Software interrupt.
INTO Interrupt on overflow.
BOUND Detect value out of range.
ENTER High-level procedure entry.
LEAVE High-level procedure exit.
5-10 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-11
INSTRUCTION SET SUMMARY
5-12 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-13
INSTRUCTION SET SUMMARY
5-14 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-15
INSTRUCTION SET SUMMARY
extensions, and SSE3 extensions. For a discussion that puts SIMD instructions in their historical context, see
Section 2.2.7, “SIMD Instructions.”
MMX instructions operate on packed byte, word, doubleword, or quadword integer operands contained in memory,
in MMX registers, and/or in general-purpose registers. For more detail on these instructions, see Chapter 9,
“Programming with Intel® MMX™ Technology.”
MMX instructions can only be executed on Intel 64 and IA-32 processors that support the MMX technology. Support
for these instructions can be detected with the CPUID instruction. See the description of the CPUID instruction in
Chapter 3, “Instruction Set Reference, A-L,” of the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 2A.
MMX instructions are divided into the following subgroups: data transfer, conversion, packed arithmetic, compar-
ison, logical, shift and rotate, and state management instructions. The sections that follow introduce each
subgroup.
5-16 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-17
INSTRUCTION SET SUMMARY
Intel SSE instructions can only be executed on Intel 64 and IA-32 processors that support Intel SSE extensions.
Support for these instructions can be detected with the CPUID instruction. See the description of the CPUID instruc-
tion in Chapter 3, “Instruction Set Reference, A-L,” of the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 2A.
Intel SSE instructions are divided into four subgroups (note that the first subgroup has subordinate subgroups of
its own):
• SIMD single precision floating-point instructions that operate on the XMM registers.
• MXCSR state management instructions.
• 64-bit SIMD integer instructions that operate on the MMX registers.
• Cacheability control, prefetch, and instruction ordering instructions.
The following sections provide an overview of these groups.
5-18 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-19
INSTRUCTION SET SUMMARY
CVTTSS2SI Convert with truncation a scalar single precision floating-point value to a scalar double-
word integer.
5.5.4 Intel® SSE Cacheability Control, Prefetch, and Instruction Ordering Instructions
The cacheability control instructions provide control over the caching of non-temporal data when storing data from
the MMX and XMM registers to memory. The PREFETCHh allows data to be prefetched to a selected cache level. The
SFENCE instruction controls instruction ordering on store operations.
MASKMOVQ Non-temporal store of selected bytes from an MMX register into memory.
MOVNTQ Non-temporal store of quadword from an MMX register into memory.
MOVNTPS Non-temporal store of four packed single precision floating-point values from an XMM
register into memory.
PREFETCHh Load 32 or more of bytes from memory to a selected level of the processor’s cache hier-
archy.
SFENCE Serializes store operations.
5-20 Vol. 1
INSTRUCTION SET SUMMARY
instruction in Chapter 3, “Instruction Set Reference, A-L,” of the Intel® 64 and IA-32 Architectures Software Devel-
oper’s Manual, Volume 2A.
These instructions are divided into four subgroups (note that the first subgroup is further divided into subordinate
subgroups):
• Packed and scalar double precision floating-point instructions.
• Packed single precision floating-point conversion instructions.
• 128-bit SIMD integer instructions.
• Cacheability-control and instruction ordering instructions.
The following sections give an overview of each subgroup.
5.6.1 Intel® SSE2 Packed and Scalar Double Precision Floating-Point Instructions
Intel SSE2 packed and scalar double precision floating-point instructions are divided into the following subordinate
subgroups: data movement, arithmetic, comparison, conversion, logical, and shuffle operations on double preci-
sion floating-point operands. These are introduced in the sections that follow.
Vol. 1 5-21
INSTRUCTION SET SUMMARY
5-22 Vol. 1
INSTRUCTION SET SUMMARY
CVTSD2SS Convert scalar double precision floating-point values to scalar single precision floating-
point values.
CVTSD2SI Convert scalar double precision floating-point values to a doubleword integer.
CVTTSD2SI Convert with truncation scalar double precision floating-point values to scalar doubleword
integers.
CVTSI2SD Convert doubleword integer to scalar double precision floating-point value.
Vol. 1 5-23
INSTRUCTION SET SUMMARY
5-24 Vol. 1
INSTRUCTION SET SUMMARY
HADDPD Performs a double precision addition on contiguous data elements. The first data element
of the result is obtained by adding the first and second elements of the first operand; the
second element by adding the first and second elements of the second operand.
HSUBPD Performs a double precision subtraction on contiguous data elements. The first data
element of the result is obtained by subtracting the second element of the first operand
from the first element of the first operand; the second element by subtracting the second
element of the second operand from the first element of the second operand.
Vol. 1 5-25
INSTRUCTION SET SUMMARY
source and destination operands. The signed 16-bit results are packed and written to the
destination operand.
PHSUBSW Performs horizontal subtraction on each adjacent pair of 16-bit signed integers by
subtracting the most significant word from the least significant word of each pair in the
source and destination operands. The signed, saturated 16-bit results are packed and
written to the destination operand.
PHSUBD Performs horizontal subtraction on each adjacent pair of 32-bit signed integers by
subtracting the most significant doubleword from the least significant double word of each
pair in the source and destination operands. The signed 32-bit results are packed and
written to the destination operand.
5-26 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-27
INSTRUCTION SET SUMMARY
(“streaming load buffers”). Subsequent streaming loads to other aligned 16-byte items in
the same streaming line may be supplied from the streaming load buffer and can improve
throughput.
5-28 Vol. 1
INSTRUCTION SET SUMMARY
PINSRB Insert a byte value from a register or memory into an XMM register.
PINSRD Insert a dword value from 32-bit register or memory into an XMM register.
PINSRQ Insert a qword value from 64-bit register or memory into an XMM register.
PEXTRB Extract a byte from an XMM register and insert the value into a general-purpose register or
memory.
PEXTRW Extract a word from an XMM register and insert the value into a general-purpose register
or memory.
PEXTRD Extract a dword from an XMM register and insert the value into a general-purpose register
or memory.
PEXTRQ Extract a qword from an XMM register and insert the value into a general-purpose register
or memory.
Vol. 1 5-29
INSTRUCTION SET SUMMARY
5-30 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-31
INSTRUCTION SET SUMMARY
In addition, AVX2 provide enhanced functionalities for broadcast/permute operations on data elements, vector
shift instructions with variable-shift count per data element, and instructions to fetch non-contiguous data
elements from memory.
• Table 14-18 lists promoted vector integer instructions in AVX2.
• Table 14-19 lists new instructions in AVX2 that complements AVX.
5-32 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-33
INSTRUCTION SET SUMMARY
VPRORRD/Q Rotate dword/qword element right by shift counts specified in a vector with conditional
update.
VPSCATTERDD/DQ Scatter dword/qword elements in a vector to memory using dword indices.
VPSCATTERQD/QQ Scatter dword/qword elements in a vector to memory using qword indices.
VPSRAQ Shift qwords right by a constant shift count and shifting in sign bits.
VPSRAVQ Shift qwords right by shift counts in a vector and shifting in sign bits.
VPTESTNMD/Q Perform bitwise NAND of dword/qword elements of two vectors and write results to
opmask.
VPTERLOGD/Q Perform bitwise ternary logic operation of three vectors with 32/64 bit granular conditional
update.
VPTESTMD/Q Perform bitwise AND of dword/qword elements of two vectors and write results to opmask.
VRCP14PD/PS Compute approximate reciprocals of packed DP/SP FP elements of a vector.
VRCP14SD/SS Compute the approximate reciprocal of the low DP/SP FP element of a vector.
VRNDSCALEPD/PS Round packed DP/SP FP elements of a vector to specified number of fraction bits.
VRNDSCALESD/SS Round the low DP/SP FP element of a vector to specified number of fraction bits.
VRSQRT14PD/PS Compute approximate reciprocals of square roots of packed DP/SP FP elements of a vector.
VRSQRT14SD/SS Compute the approximate reciprocal of square root of the low DP/SP FP element of a
vector.
VSCALEPD/PS Multiply packed DP/SP FP elements of a vector by powers of two with exponents specified
in a second vector.
VSCALESD/SS Multiply the low DP/SP FP element of a vector by powers of two with exponent specified in
the corresponding element of a second vector.
VSCATTERDD/DQ Scatter SP/DP FP elements in a vector to memory using dword indices.
VSCATTERQD/QQ Scatter SP/DP FP elements in a vector to memory using qword indices.
VSHUFF32X4/64X2 Shuffle 128-bit lanes of a vector with 32/64 bit granular conditional update.
VSHUFI32X4/64X2 Shuffle 128-bit lanes of a vector with 32/64 bit granular conditional update.
512-bit instruction mnemonics in AVX-512DQ that are not Intel AVX or AVX2 promotions include:
VCVT(T)PD2QQ Convert packed DP FP elements of a vector to packed signed 64-bit integers.
VCVT(T)PD2UQQ Convert packed DP FP elements of a vector to packed unsigned 64-bit integers.
VCVT(T)PS2QQ Convert packed SP FP elements of a vector to packed signed 64-bit integers.
VCVT(T)PS2UQQ Convert packed SP FP elements of a vector to packed unsigned 64-bit integers.
VCVTUQQ2PD/PS Convert packed unsigned 64-bit integers to packed DP/SP FP elements.
VEXTRACTF64X2 Extract a vector from a full-length vector with 64-bit granular update.
VEXTRACTI64X2 Extract a vector from a full-length vector with 64-bit granular update.
VFPCLASSPD/PS Test packed DP/SP FP elements in a vector by numeric/special-value category.
VFPCLASSSD/SS Test the low DP/SP FP element by numeric/special-value category.
VINSERTF64X2 Insert a 128-bit vector into a full-length vector with 64-bit granular update.
VINSERTI64X2 Insert a 128-bit vector into a full-length vector with 64-bit granular update.
VPMOVM2D/Q Convert opmask register to vector register in 32/64-bit granularity.
VPMOVB2D/Q2M Convert a vector register in 32/64-bit granularity to an opmask register.
VPMULLQ Multiply packed signed 64-bit integer elements of two vectors and store low 64-bit signed
result.
VRANGEPD/PS Perform RANGE operation on each pair of DP/SP FP elements of two vectors using specified
range primitive in imm8.
VRANGESD/SS Perform RANGE operation on the pair of low DP/SP FP element of two vectors using speci-
fied range primitive in imm8.
5-34 Vol. 1
INSTRUCTION SET SUMMARY
VREDUCEPD/PS Perform Reduction operation on packed DP/SP FP elements of a vector using specified
reduction primitive in imm8.
VREDUCESD/SS Perform Reduction operation on the low DP/SP FP element of a vector using specified
reduction primitive in imm8.
512-bit instruction mnemonics in AVX-512BW that are not Intel AVX or AVX2 promotions include:
VDBPSADBW Double block packed Sum-Absolute-Differences on unsigned bytes.
VMOVDQU8/16 VMOVDQU with 8/16-bit granular conditional update.
VPBLENDMB Replaces the VPBLENDVB instruction (using opmask as select control).
VPBLENDMW Blend word elements using opmask as select control.
VPBROADCASTB/W Broadcast from general-purpose register to vector register.
VPCMPB/UB Compare packed signed/unsigned bytes using specified primitive.
VPCMPW/UW Compare packed signed/unsigned words using specified primitive.
VPERMW Permute packed word elements.
VPERMI2B/W Full permute from two tables of byte/word elements overwriting the index vector.
VPMOVM2B/W Convert opmask register to vector register in 8/16-bit granularity.
VPMOVB2M/W2M Convert a vector register in 8/16-bit granularity to an opmask register.
VPMOV(S|US)WB Down convert word elements in a vector to byte elements using truncation (saturation |
unsigned saturation).
VPSLLVW Shift word elements in a vector left by shift counts in a vector.
VPSRAVW Shift words right by shift counts in a vector and shifting in sign bits.
VPSRLVW Shift word elements in a vector right by shift counts in a vector.
VPTESTNMB/W Perform bitwise NAND of byte/word elements of two vectors and write results to opmask.
VPTESTMB/W Perform bitwise AND of byte/word elements of two vectors and write results to opmask.
512-bit instruction mnemonics in AVX-512CD that are not Intel AVX or AVX2 promotions include:
VPBROADCASTM Broadcast from opmask register to vector register.
VPCONFLICTD/Q Detect conflicts within a vector of packed 32/64-bit integers.
VPLZCNTD/Q Count the number of leading zero bits of packed dword/qword elements.
Vol. 1 5-35
INSTRUCTION SET SUMMARY
5-36 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-37
INSTRUCTION SET SUMMARY
5-38 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-39
INSTRUCTION SET SUMMARY
Table 5-3. Supervisor and User Mode Enclave Instruction Leaf Functions in Long-Form of SGX1
Supervisor Instruction Description User Instruction Description
ENCLS[EADD] Add a page ENCLU[EENTER] Enter an Enclave
ENCLS[EBLOCK] Block an EPC page ENCLU[EEXIT] Exit an Enclave
ENCLS[ECREATE] Create an enclave ENCLU[EGETKEY] Create a cryptographic key
ENCLS[EDBGRD] Read data by debugger ENCLU[EREPORT] Create a cryptographic report
ENCLS[EDBGWR] Write data by debugger ENCLU[ERESUME] Re-enter an Enclave
ENCLS[EEXTEND] Extend EPC page measurement
ENCLS[EINIT] Initialize an enclave
ENCLS[ELDB] Load an EPC page as blocked
ENCLS[ELDU] Load an EPC page as unblocked
ENCLS[EPA] Add version array
ENCLS[EREMOVE] Remove a page from EPC
ENCLS[ETRACK] Activate EBLOCK checks
ENCLS[EWB] Write back/invalidate an EPC page
5-40 Vol. 1
INSTRUCTION SET SUMMARY
Vol. 1 5-41
INSTRUCTION SET SUMMARY
NOTE
For instructions with a CPUID feature flag specifying AVX10, the programmer must check the
available vector options on the processor at run-time via CPUID Leaf 24H, the Intel AVX10
Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector width and as such
will determine the set of instructions available to the programmer listed in each instruction’s opcode
table.
5-42 Vol. 1
4. Updates to Chapter 16, Volume 1
Change bars and violet text show changes to Chapter 16 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1: Basic Architecture.
------------------------------------------------------------------------------------------
Changes to this chapter:
• Added new Chapter 16, “Programming with Intel® AVX10.”
16.1 INTRODUCTION
Intel® Advanced Vector Extensions 10 (Intel® AVX10) represents the first major new vector ISA since the introduc-
tion of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) in 2013. This ISA establishes a common,
converged vector instruction set across all Intel architectures, incorporating the modern vectorization aspects of
Intel AVX-512. This ISA will be supported on all future processors, including Performance cores (P-cores) and Effi-
cient cores (E-cores).
The Intel AVX10 ISA represents the latest in ISA innovations, instructions, and features moving forward. Based on
the Intel AVX-512 ISA feature set and including all Intel AVX-512 instructions introduced with Intel® Xeon® 6 P-
core processors based on Granite Rapids microarchitecture, it supports all instruction vector lengths (128, 256, and
512), as well as scalar and opmask instructions. Implementations of Intel AVX10 with vector lengths of at least 256
bits will be supported across all Intel® processors.
Vol. 1 16-1
PROGRAMMING WITH INTEL® AVX10
Several other important tenets regarding Intel AVX10 enumeration are as follows:
• Versions are expected to be inclusive such that version N+1 is a superset of version N. Once an instruction is
introduced in Intel AVX10.x, it is expected to be carried forward in all subsequent Intel AVX10 versions,
allowing a developer to check only for a version greater than or equal to the desired version.
• Any processor that enumerates support for Intel AVX10 will also enumerate support for Intel AVX and Intel
AVX2.
• Developers can assume that the highest supported vector length for a processor implies that all lesser vector
lengths are also supported. Scalar Intel AVX-512 instructions will be supported independent of the maximum
vector width.
The first version of Intel AVX10 (Version 1, or Intel® AVX10.1) will support only the Intel AVX-512 instruction set
at 128, 256, and 512 bits. Applications written to Intel AVX10.1 will run on any future Intel processor that enumer-
ates Intel AVX10.1 or higher at the matching desired vector lengths. Intel AVX-512 instruction families included in
Intel AVX10.1 are shown in Table 16-2.
Table 16-2. Intel® AVX-512 CPUID Feature Flags Included in Intel® AVX10
Feature Introduction Intel® AVX-512 CPUID Feature Flags Included in Intel® AVX10
Intel® Xeon® Scalable Processor Family based on Skylake AVX512F, AVX512CD, AVX512BW, AVX512DQ
microarchitecture
Intel® Core™ processors based on Cannon Lake microarchitecture AVX512-VBMI, AVX512-IFMA
2nd generation Intel® Xeon® Scalable Processor Family based on AVX512-VNNI
Cascade Lake product
3rd generation Intel® Xeon® Scalable Processor Family based on AVX512-BF16
Cooper Lake product
3rd generation Intel® Xeon® Scalable Processor Family based on Ice AVX512-VPOPCNTDQ, AVX512-VBMI2, VAES, GFNI,
Lake microarchitecture VPCLMULQDQ, AVX512-BITALG
4th generation Intel® Xeon® Scalable Processor Family based on AVX512-FP16
Sapphire Rapids microarchitecture
NOTE
VAES, VPCLMULQDQ, and GFNI EVEX instructions will be supported on Intel AVX10.1 machines but
will continue to be enumerated by their existing discrete CPUID feature flags. This requires the
developer to check for both the feature and Intel AVX10, e.g., {AVX10.1 AND VAES}.
16-2 Vol. 1
PROGRAMMING WITH INTEL® AVX10
New vector ISA features will only be added to the Intel AVX10 ISA moving forward. While Intel AVX10/512 includes
all Intel AVX-512 instructions, it is important to note that applications compiled to Intel AVX-512 with vector length
limited to 256 bits are not guaranteed to be compatible with an Intel AVX10/256 processor.
Table 16-3. Feature Differences Between Intel® AVX-512 and Intel® AVX10
Feature Intel® AVX-512 Intel® AVX10.1/256 Intel® AVX10.1/512
128-bit vector (XMM) register support Yes Yes Yes
256-bit vector (YMM) register support Yes Yes Yes
512-bit vector (ZMM) register support Yes No Yes
YMM embedded rounding No No No
ZMM embedded rounding Yes No Yes
Vol. 1 16-3
PROGRAMMING WITH INTEL® AVX10
16-4 Vol. 1
5. Updates to Chapter 2, Volume 2A
Change bars and violet text show changes to Chapter 2 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2A: Instruction Set Reference, A-L.
------------------------------------------------------------------------------------------
Changes to this chapter:
• Added new exception Type 14 to existing tables.
This chapter describes the instruction format for all Intel 64 and IA-32 processors. The instruction format for
protected mode, real-address mode and virtual-8086 mode is described in Section 2.1. Increments provided for IA-
32e mode and its sub-modes are described in Section 2.2.
7 6 5 3 2 0 7 6 5 3 2 0
Reg/
Mod Opcode R/M Scale Index Base
1. The REX prefix is optional, but if used must be immediately before the opcode; see Section
2.2.1, “REX Prefixes” for additional information.
2. For VEX encoding information, see Section 2.3, “Intel® Advanced Vector Extensions (Intel®
AVX)”.
3. Some rare instructions can take an 8B immediate or 8B displacement.
Vol. 2A 2-1
INSTRUCTION FORMAT
— BND prefix is encoded using F2H if the following conditions are true:
• CPUID.(EAX=07H, ECX=0):EBX.MPX[bit 14] is set.
• BNDCFGU.EN and/or IA32_BNDCFGS.EN is set.
• When the F2 prefix precedes a near CALL, a near RET, a near JMP, a short Jcc, or a near Jcc instruction
(see Appendix E, “Intel® Memory Protection Extensions,” of the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 1).
• Group 2
— Segment override prefixes:
• 2EH—CS segment override (use with any branch instruction is reserved).
• 36H—SS segment override prefix (use with any branch instruction is reserved).
• 3EH—DS segment override prefix (use with any branch instruction is reserved).
• 26H—ES segment override prefix (use with any branch instruction is reserved).
• 64H—FS segment override prefix (use with any branch instruction is reserved).
• 65H—GS segment override prefix (use with any branch instruction is reserved).
— Branch hints1:
• 2EH—Branch not taken (used only with Jcc instructions).
• 3EH—Branch taken (used only with Jcc instructions).
• Group 3
• Operand-size override prefix is encoded using 66H (66H is also used as a mandatory prefix for some
instructions).
• Group 4
• 67H—Address-size override prefix.
The LOCK prefix (F0H) forces an operation that ensures exclusive use of shared memory in a multiprocessor envi-
ronment. See “LOCK—Assert LOCK# Signal Prefix” in Chapter 3, “Instruction Set Reference, A-L,” for a description
of this prefix.
Repeat prefixes (F2H, F3H) cause an instruction to be repeated for each element of a string. Use these prefixes
only with string and I/O instructions (MOVS, CMPS, SCAS, LODS, STOS, INS, and OUTS). Use of repeat prefixes
and/or undefined opcodes with other Intel 64 or IA-32 instructions is reserved; such use may cause unpredictable
behavior.
Some instructions may use F2H or F3H as a mandatory prefix to express distinct functionality.
Branch hint prefixes (2EH, 3EH) allow a program to give a hint to the processor about the most likely code path for
a branch when used on conditional branch instructions (Jcc).
The operand-size override prefix allows a program to switch between 16- and 32-bit operand sizes. Either size can
be the default; use of the prefix selects the non-default size.
Some SSE2/SSE3/SSSE3/SSE4 instructions and instructions using a three-byte sequence of primary opcode bytes
may use 66H as a mandatory prefix to express distinct functionality.
Other use of the 66H prefix is reserved; such use may cause unpredictable behavior.
The address-size override prefix (67H) allows programs to switch between 16- and 32-bit addressing. Either size
can be the default; the prefix selects the non-default size. Using this prefix and/or other undefined opcodes when
operands for the instruction do not reside in memory is reserved; such use may cause unpredictable behavior.
1. Microarchitectural behavior varies; refer to the Intel® 64 and IA-32 Architectures Optimization Reference Manual.
2-2 Vol. 2A
INSTRUCTION FORMAT
2.1.2 Opcodes
A primary opcode can be 1, 2, or 3 bytes in length. An additional 3-bit opcode field is sometimes encoded in the
ModR/M byte. Smaller fields can be defined within the primary opcode. Such fields define the direction of opera-
tion, size of displacements, register encoding, condition codes, or sign extension. Encoding fields used by an
opcode vary depending on the class of operation.
Two-byte opcode formats for general-purpose and SIMD instructions consist of one of the following:
• An escape opcode byte 0FH as the primary opcode and a second opcode byte.
• A mandatory prefix (66H, F2H, or F3H), an escape opcode byte, and a second opcode byte (same as previous
bullet).
For example, CVTDQ2PD consists of the following sequence: F3 0F E6. The first byte is a mandatory prefix (it is not
considered as a repeat prefix).
Three-byte opcode formats for general-purpose and SIMD instructions consist of one of the following:
• An escape opcode byte 0FH as the primary opcode, plus two additional opcode bytes.
• A mandatory prefix (66H, F2H, or F3H), an escape opcode byte, plus two additional opcode bytes (same as
previous bullet).
For example, PHADDW for XMM registers consists of the following sequence: 66 0F 38 01. The first byte is the
mandatory prefix.
Valid opcode expressions are defined in Appendix A and Appendix B.
Vol. 2A 2-3
INSTRUCTION FORMAT
Mod 11
RM 000
/digit (Opcode); REG = 001
C8H 11001000
2-4 Vol. 2A
INSTRUCTION FORMAT
NOTES:
1. The default segment register is SS for the effective addresses containing a BP index, DS for other effective addresses.
2. The disp16 nomenclature denotes a 16-bit displacement that follows the ModR/M byte and that is added to the index.
3. The disp8 nomenclature denotes an 8-bit displacement that follows the ModR/M byte and that is sign-extended and added to the
index.
Vol. 2A 2-5
INSTRUCTION FORMAT
NOTES:
1. The [--][--] nomenclature means a SIB follows the ModR/M byte.
2. The disp32 nomenclature denotes a 32-bit displacement that follows the ModR/M byte (or the SIB byte if one is present) and that is
added to the index.
3. The disp8 nomenclature denotes an 8-bit displacement that follows the ModR/M byte (or the SIB byte if one is present) and that is
sign-extended and added to the index.
Table 2-3 is organized to give 256 possible values of the SIB byte (in hexadecimal). General purpose registers used
as a base are indicated across the top of the table, along with corresponding values for the SIB byte’s base field.
Table rows in the body of the table indicate the register used as the index (SIB byte bits 3, 4, and 5) and the scaling
factor (determined by SIB byte bits 6 and 7).
2-6 Vol. 2A
INSTRUCTION FORMAT
NOTES:
1. The [*] nomenclature means a disp32 with no base if the MOD is 00B. Otherwise, [*] means disp8 or disp32 + [EBP]. This provides the
following address modes:
MOD bits Effective Address
00 [scaled index] + disp32
01 [scaled index] + disp8 + [EBP]
10 [scaled index] + disp32 + [EBP]
Vol. 2A 2-7
INSTRUCTION FORMAT
Grp 1, Grp (optional) 1-, 2-, or 1 byte 1 byte Address Immediate data
2, Grp 3, 3-byte (if required) (if required) displacement of of 1, 2, or 4
Grp 4 opcode 1, 2, or 4 bytes bytes or none
(optional)
2.2.1.1 Encoding
Intel 64 and IA-32 instruction formats specify up to three registers by using 3-bit fields in the encoding, depending
on the format:
• ModR/M: the reg and r/m fields of the ModR/M byte.
• ModR/M with SIB: the reg field of the ModR/M byte, the base and index fields of the SIB (scale, index, base)
byte.
• Instructions without ModR/M: the reg field of the opcode.
In 64-bit mode, these formats do not change. Bits needed to define fields in the 64-bit context are provided by the
addition of REX prefixes.
2-8 Vol. 2A
INSTRUCTION FORMAT
• REX.B either modifies the base in the ModR/M r/m field or SIB base field; or it modifies the opcode reg field
used for accessing GPRs.
ModRM Byte
Rrrr Bbbb
OM17Xfig1-3
Figure 2-4. Memory Addressing Without an SIB Byte; REX.X Not Used
ModRM Byte
Rrrr Bbbb
OM17Xfig1-4
Figure 2-5. Register-Register Addressing (No Memory Operand); REX.X Not Used
Vol. 2A 2-9
INSTRUCTION FORMAT
Bbbb
OM17Xfig1-6
Figure 2-7. Register Operand Coded in Opcode Byte; REX.X & REX.R Not Used
In the IA-32 architecture, byte registers (AH, AL, BH, BL, CH, CL, DH, and DL) are encoded in the ModR/M byte’s
reg field, the r/m field or the opcode reg field as registers 0 through 7. REX prefixes provide an additional
addressing capability for byte-registers that makes the least-significant byte of GPRs available for byte operations.
Certain combinations of the fields of the ModR/M byte and the SIB byte have special meaning for register encod-
ings. For some combinations, fields expanded by the REX prefix are not decoded. Table 2-5 describes how each
case behaves.
2-10 Vol. 2A
INSTRUCTION FORMAT
2.2.1.3 Displacement
Addressing in 64-bit mode uses existing 32-bit ModR/M and SIB encodings. The ModR/M and SIB displacement
sizes do not change. They remain 8 bits or 32 bits and are sign-extended to 64 bits.
2.2.1.5 Immediates
In 64-bit mode, the typical size of immediate operands remains 32 bits. When the operand size is 64 bits, the
processor sign-extends all immediates to 64 bits prior to their use.
Support for 64-bit immediate operands is accomplished by expanding the semantics of the existing move (MOV
reg, imm16/32) instructions. These instructions (opcodes B8H – BFH) move 16-bits or 32-bits of immediate data
(depending on the effective operand size) into a GPR. When the effective operand size is 64 bits, these instructions
can be used to load an immediate into a GPR. A REX prefix is needed to override the 32-bit default operand size to
a 64-bit operand size.
For example:
Vol. 2A 2-11
INSTRUCTION FORMAT
In 64-bit mode, the ModR/M Disp32 (32-bit displacement) encoding is re-defined to be RIP+Disp32 rather than
displacement-only. See Table 2-7.
The ModR/M encoding for RIP-relative addressing does not depend on using a prefix. Specifically, the r/m bit field
encoding of 101B (used to select RIP-relative addressing) is not affected by the REX prefix. For example, selecting
R13 (REX.B = 1, r/m = 101B) with mod = 00B still results in RIP-relative addressing. The 4-bit r/m field of REX.B
combined with ModR/M is not fully decoded. In order to address R13 with no displacement, software must encode
R13 + 0 using a 1-byte displacement of zero.
RIP-relative addressing is enabled by 64-bit mode, not by a 64-bit address-size. The use of the address-size prefix
does not disable RIP-relative addressing. The effect of the address-size prefix is to truncate and zero-extend the
computed effective address to 32 bits.
2-12 Vol. 2A
INSTRUCTION FORMAT
Vol. 2A 2-13
INSTRUCTION FORMAT
2-14 Vol. 2A
INSTRUCTION FORMAT
7 0 7 6 3 2 1 0
L: Vector Length
0: scalar or 128-bit vector
1: 256-bit vector
The following subsections describe the various fields in two or three-byte VEX prefix.
Vol. 2A 2-15
INSTRUCTION FORMAT
2.3.5.6 2-byte VEX Byte 1, bits[6:3] and 3-byte VEX Byte 2, bits [6:3]- ‘vvvv’ the Source or Dest
Register Specifier
In 32-bit mode the VEX first byte C4 and C5 alias onto the LES and LDS instructions. To maintain compatibility with
existing programs the VEX 2nd byte, bits [7:6] must be 11b. To achieve this, the VEX payload bits are selected to
place only inverted, 64-bit valid fields (extended register selectors) in these upper bits.
The 2-byte VEX Byte 1, bits [6:3] and the 3-byte VEX, Byte 2, bits [6:3] encode a field (shorthand VEX.vvvv) that
for instructions with 2 or more source registers and an XMM or YMM or memory destination encodes the first source
register specifier stored in inverted (1’s complement) form.
VEX.vvvv is not used by the instructions with one source (except certain shifts, see below) or on instructions with
no XMM or YMM or memory destination. If an instruction does not use VEX.vvvv then it should be set to 1111b
otherwise instruction will #UD.
In 64-bit mode all 4 bits may be used. See Table for the encoding of the XMM or YMM registers. In 32-bit and 16-
bit modes bit 6 must be 1 (if bit 6 is not 1, the 2-byte VEX version will generate LDS instruction and the 3-byte VEX
version will ignore this bit).
2-16 Vol. 2A
INSTRUCTION FORMAT
The VEX.vvvv field is encoded in bit inverted format for accessing a register operand.
Vol. 2A 2-17
INSTRUCTION FORMAT
VEX.m-mmmm is only available on the 3-byte VEX. The 2-byte VEX implies a leading 0Fh opcode byte.
2.3.6.2 2-byte VEX byte 1, bit[2], and 3-byte VEX byte 2, bit [2]- “L”
The vector length field, VEX.L, is encoded in bit[2] of either the second byte of 2-byte VEX, or the third byte of 3-
byte VEX. If “VEX.L = 1”, it indicates 256-bit vector operation. “VEX.L = 0” indicates scalar and 128-bit vector
operations.
The instruction VZEROUPPER is a special case that is encoded with VEX.L = 0, although its operation zero’s bits
255:128 of all YMM registers accessible in the current operating mode. See Table 2-11.
2-18 Vol. 2A
INSTRUCTION FORMAT
2.3.6.3 2-byte VEX byte 1, bits[1:0], and 3-byte VEX byte 2, bits [1:0]- “pp”
Up to one implied prefix is encoded by bits[1:0] of either the 2-byte VEX byte 1 or the 3-byte VEX byte 2. The prefix
behaves as if it was encoded prior to VEX, but after all other encoded prefixes. See Table 2-12.
2.3.10 Intel® AVX Instructions and the Upper 128-bits of YMM registers
If an instruction with a destination XMM register is encoded with a VEX prefix, the processor zeroes the upper bits
(above bit 128) of the equivalent YMM register. Legacy SSE instructions without VEX preserve the upper bits.
Vol. 2A 2-19
INSTRUCTION FORMAT
recommended that software handling involuntary calls accommodate this by not executing instructions encoded
with VEX.128 and VEX.256 prefixes. In the event that it is not possible or desirable to restrict these instructions,
then software must take special care to avoid actions that would, on future processors, zero the upper bits of vector
registers.
Processors that support further vector-register extensions (defining bits beyond bit 255) will also extend the
XSAVE and XRSTOR instructions to save and restore these extensions. To ensure forward compatibility, software
that handles involuntary calls and that uses instructions encoded with VEX.128 and VEX.256 prefixes should first
save and then restore the vector registers (with any extensions) using the XSAVE and XRSTOR instructions with
save/restore masks that set bits that correspond to all vector-register extensions. Ideally, software should rely on
a mechanism that is cognizant of which bits to set. (E.g., an OS mechanism that sets the save/restore mask bits
for all vector-register extensions that are enabled in XCR0.) Saving and restoring state with instructions other than
XSAVE and XRSTOR will, on future processors with wider vector registers, corrupt the extended state of the vector
registers - even if doing so functions correctly on processors supporting 256-bit vector registers. (The same is true
if XSAVE and XRSTOR are used with a save/restore mask that does not set bits corresponding to all supported
extensions to the vector registers.)
2-20 Vol. 2A
INSTRUCTION FORMAT
Vol. 2A 2-21
INSTRUCTION FORMAT
register (if present) is used to denote a stride between memory rows. The index register is scaled by the sib.scale
field as usual. The base register is added to the displacement, if present.
In the instruction encoding, the ModR/M byte is represented several ways depending on the role it plays. The
ModR/M byte has 3 fields: 2-bit ModR/M.mod field, a 3-bit ModR/M.reg field and a 3-bit ModR/M.r/m field. When all
bits of the ModR/M byte have fixed values for an instruction, the 2-hex nibble value of that byte is presented after
the opcode in the encoding boxes on the instruction description pages. When only some fields of the ModR/M byte
must contain fixed values, those values are specified as follows:
• If only the ModR/M.mod must be 0b11, and ModR/M.reg and ModR/M.r/m fields are unrestricted, this is
denoted as 11:rrr:bbb. The rrr correspond to the 3-bits of the ModR/M.reg field and the bbb correspond to the
3-bits of the ModR/M.r/m field.
• If the ModR/M.mod field is constrained to be a value other than 0b11, i.e., it must be one of 0b00, 0b01, or
0b10, then the notation !(11) is used.
• If the ModR/M.reg field had a specific required value, e.g., 0b101, that would be denoted as mm:101:bbb.
NOTE
Historically this document only specified the ModR/M.reg field restrictions with the notation /0 ... /7
and did not specify restrictions on the ModR/M.mod and ModR/M.r/m fields in the encoding boxes.
NOTE
Instructions that operate only with MMX, X87, or general-purpose registers are not covered by the
exception classes defined in this section. For instructions that operate on MMX registers, see
Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Registers”
in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2-22 Vol. 2A
INSTRUCTION FORMAT
Vol. 2A 2-23
INSTRUCTION FORMAT
(*) - Additional exception restrictions are present - see the Instruction description for details
(**) - Instruction behavior on alignment check reporting with mask bits of less than all 1s are the same as with mask bits of all 1s, i.e., no
alignment checks are performed.
(***) - PCMPESTRI, PCMPESTRM, PCMPISTRI, PCMPISTRM, and LDDQU instructions do not cause #GP if the memory operand is not
aligned to 16-Byte boundary.
Table 2-15 classifies exception behaviors for Intel AVX instructions. Within each class of exception conditions that
are listed in Table 2-18 through Table 2-27, certain subsets of Intel AVX instructions may be subject to #UD excep-
tion depending on the encoded value of the VEX.L field. Table 2-16 and Table 2-17 provide supplemental informa-
tion of Intel AVX instructions that may be subject to #UD exception if encoded with incorrect values in the VEX.W
or VEX.L field.
2-24 Vol. 2A
INSTRUCTION FORMAT
Type 3
VMASKMOVDQU, VMPSADBW, VPABSB/W/D, VPCMP(E/I)STRI/M,
VPACKSSWB/DW, VPACKUSWB/DW, VPADDB/W/D, VPADDQ, PHMINPOSUW
VPADDSB/W, VPADDUSB/W, VPALIGNR, VPAND, VPANDN,
VPAVGB/W, VPBLENDVB, VPBLENDW, VPCMP(E/I)STRI/M,
VPCMPEQB/W/D/Q, VPCMPGTB/W/D/Q, VPHADDW/D,
VPHADDSW, VPHMINPOSUW, VPHSUBD/W, VPHSUBSW,
VPMADDWD, VPMADDUBSW, VPMAXSB/W/D,
Type 4
VPMAXUB/W/D, VPMINSB/W/D, VPMINUB/W/D, VPMULHUW,
VPMULHRSW, VPMULHW/LW, VPMULLD, VPMULUDQ,
VPMULDQ, VPOR, VPSADBW, VPSHUFB/D, VPSHUFHW/LW,
VPSIGNB/W/D, VPSLLW/D/Q, VPSRAW/D, VPSRLW/D/Q,
VPSUBB/W/D/Q, VPSUBSB/W, VPUNPCKHBW/WD/DQ,
VPUNPCKHQDQ, VPUNPCKLBW/WD/DQ, VPUNPCKLQDQ,
VPXOR
VEXTRACTPS, VINSERTPS, VMOVD, VMOVQ, VMOVLPD, Same as column 3
VMOVLPS, VMOVHPD, VMOVHPS, VPEXTRB, VPEXTRD,
Type 5
VPEXTRW, VPEXTRQ, VPINSRB, VPINSRD, VPINSRW,
VPINSRQ, VPMOVSX/ZX, VLDMXCSR, VSTMXCSR
VEXTRACTF128,
VPERM2F128,
Type 6 VBROADCASTSD,
VBROADCASTF128,
VINSERTF128,
VMOVLHPS, VMOVHLPS, VPMOVMSKB, VPSLLDQ, VPSRLDQ, VMOVLHPS, VMOVHLPS
Type 7 VPSLLW, VPSLLD, VPSLLQ, VPSRAW, VPSRAD, VPSRLW,
VPSRLD, VPSRLQ
Type 8
Type 11
Type 12
Vol. 2A 2-25
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual-8086
64-bit
Real
Exception Cause of Exception
X X VEX prefix.
VEX prefix:
X X If XCR0[2:1] ? ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Invalid Opcode, Legacy SSE instruction:
#UD X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Avail-
X X X X If CR0.TS[bit 3]=1.
able, #NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
VEX.256: Memory operand is not 32-byte aligned.
X X
VEX.128: Memory operand is not 16-byte aligned.
X X X X Legacy SSE: Memory operand is not 16-byte aligned.
General Protec-
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
tion, #GP(0) X
ments.
X If the memory address is in a non-canonical form.
X X If any part of the operand lies outside the effective address space from 0 to FFFFH.
Page Fault
X X X For a page fault.
#PF(fault-code)
2-26 Vol. 2A
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual 8086
64-bit
Real
Exception Cause of Exception
X X VEX prefix.
X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 0.
VEX prefix:
X X If XCR0[2:1] ? ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Invalid Opcode,
Legacy SSE instruction:
#UD
X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Avail-
X X X X If CR0.TS[bit 3]=1.
able, #NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
X X X X Legacy SSE: Memory operand is not 16-byte aligned.
General Protec- X For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.
tion, #GP(0) X If the memory address is in a non-canonical form.
X X If any part of the operand lies outside the effective address space from 0 to FFFFH.
Page Fault
X X X For a page fault.
#PF(fault-code)
SIMD Floating-
point Exception, X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 1.
#XM
Vol. 2A 2-27
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual-8086
64-bit
Exception Real Cause of Exception
X X VEX prefix.
X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 0.
VEX prefix:
X X If XCR0[2:1] ? ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Invalid Opcode, #UD Legacy SSE instruction:
X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
X
ments.
General Protection,
X If the memory address is in a non-canonical form.
#GP(0)
If any part of the operand lies outside the effective address space from 0 to
X X
FFFFH.
Page Fault
X X X For a page fault.
#PF(fault-code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.
SIMD Floating-point
X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 1.
Exception, #XM
2-28 Vol. 2A
INSTRUCTION FORMAT
2.5.4 Exceptions Type 4 (>=16 Byte Mem Arg, No Alignment, No Floating-point Exceptions)
Protected and
Compatibility
Virtual-8086
64-bit
Exception Real Cause of Exception
X X VEX prefix.
VEX prefix:
X X If XCR0[2:1] ? ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Legacy SSE instruction:
Invalid Opcode, #UD
X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
X X X X Legacy SSE: Memory operand is not 16-byte aligned.1
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
X
General Protection, ments.
#GP(0) X If the memory address is in a non-canonical form.
If any part of the operand lies outside the effective address space from 0 to
X X
FFFFH.
Page Fault
X X X For a page fault.
#PF(fault-code)
NOTES:
1. LDDQU, MOVUPD, MOVUPS, PCMPESTRI, PCMPESTRM, PCMPISTRI, and PCMPISTRM instructions do not cause #GP if the memory
operand is not aligned to 16-Byte boundary.
Vol. 2A 2-29
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual-8086
64-bit
Exception Real Cause of Exception
X X VEX prefix.
VEX prefix:
X X If XCR0[2:1] ? ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Legacy SSE instruction:
Invalid Opcode, #UD
X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
X
ments.
General Protection,
X If the memory address is in a non-canonical form.
#GP(0)
If any part of the operand lies outside the effective address space from 0 to
X X
FFFFH.
Page Fault
X X X For a page fault.
#PF(fault-code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.
2-30 Vol. 2A
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual-8086
64-bit
Real
Exception Cause of Exception
X X VEX prefix.
If XCR0[2:1] ? ‘11b’.
X X
If CR4.OSXSAVE[bit 18]=0.
Invalid Opcode, #UD
X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
General Protection, X
ments.
#GP(0)
X If the memory address is in a non-canonical form.
Page Fault
X X For a page fault.
#PF(fault-code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X
#AC(0) unaligned memory access is made while the current privilege level is 3.
Vol. 2A 2-31
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual-8086
64-bit
Exception Real Cause of Exception
X X VEX prefix.
VEX prefix:
X X If XCR0[2:1] ? ‘11b’.
If CR4.OSXSAVE[bit 18]=0.
Legacy SSE instruction:
Invalid Opcode, #UD
X X X X If CR0.EM[bit 2] = 1.
If CR4.OSFXSR[bit 9] = 0.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a VEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X If CR0.TS[bit 3]=1.
#NM
64-bit
Real
2-32 Vol. 2A
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual-8086
64-bit
Exception Real Cause of Exception
Vol. 2A 2-33
INSTRUCTION FORMAT
2.5.10 Exceptions Type 12 (VEX-only, VSIB Mem Arg, No AC, No Floating-point Exceptions)
Protected and
Compatibility
Virtual-8086
64-bit
Real
Exception Cause of Exception
2-34 Vol. 2A
INSTRUCTION FORMAT
Any VEX-encoded GPR instruction with a 66H, F2H, or F3H prefix preceding VEX will #UD.
Any VEX-encoded GPR instruction with a REX prefix proceeding VEX will #UD.
VEX-encoded GPR instructions are not supported in real and virtual 8086 modes.
(*) - Additional exception restrictions are present - see the Instruction description for details.
64-bit
Real
Vol. 2A 2-35
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual-8086
64-bit
Real
2-36 Vol. 2A
INSTRUCTION FORMAT
# of bytes: 4 1 1 1 2, 4 1
[Prefixes] EVEX Opcode ModR/M [SIB] [Disp16,32] [Immediate]
1
[Disp8*N]
Figure 2-10. Intel® AVX-512 Instruction Format and the EVEX Prefix
The EVEX prefix is a 4-byte prefix, with the first two bytes derived from unused encoding form of the 32-bit-mode-
only BOUND instruction. The layout of the EVEX prefix is shown in Figure 2-11. The first byte must be 62H, followed
by three payload bytes, denoted as P0, P1, and P2 individually or collectively as P[23:0] (see Figure 2-11).
EVEX 62H P0 P1 P2
7 6 5 4 3 2 1 0
P0 R X B R’ 0 m m m P[7:0]
7 6 5 4 3 2 1 0
P1 W v v v v 1 p p P[15:8]
7 6 5 4 3 2 1 0
P2 z L’ L b V’ a a a P[23:16]
Vol. 2A 2-37
INSTRUCTION FORMAT
The bit fields in P[23:0] are divided into the following functional groups (Table 2-32 provides a tabular summary):
• Reserved bits: P[3] must be 0, otherwise #UD.
• Fixed-value bit: P[10] must be 1, otherwise #UD.
• Compressed legacy prefix/escape bytes: P[1:0] is identical to the lowest 2 bits of VEX.mmmmm; P[9:8] is
identical to VEX.pp.
• EVEX.mmm: P[2:0] provides access to up to eight decoding maps. Currently, only the following decoding maps
are supported: 1, 2, 3, 5, and 6. Map ids 1, 2, and 3 are denoted by 0F, 0F38, and 0F3A, respectively, in the
instruction encoding descriptions.
• Operand specifier modifier bits for vector register, general purpose register, memory addressing: P[7:5] allows
access to the next set of 8 registers beyond the low 8 registers when combined with ModR/M register specifiers.
• Operand specifier modifier bit for vector register: P[4] (or EVEX.R’) allows access to the high 16 vector register
set when combined with P[7] and ModR/M.reg specifier; P[6] can also provide access to a high 16 vector
register when SIB or VSIB addressing are not needed.
• Non-destructive source /vector index operand specifier: P[19] and P[14:11] encode the second source vector
register operand in a non-destructive source syntax, vector index register operand can access an upper 16
vector register using P[19].
• Op-mask register specifiers: P[18:16] encodes op-mask register set k0-k7 in instructions operating on vector
registers.
• EVEX.W: P[15] is similar to VEX.W which serves either as opcode extension bit or operand size promotion to
64-bit in 64-bit mode.
• Vector destination merging/zeroing: P[23] encodes the destination result behavior which either zeroes the
masked elements or leave masked element unchanged.
• Broadcast/Static-rounding/SAE context bit: P[20] encodes multiple functionality, which differs across different
classes of instructions and can affect the meaning of the remaining field (EVEX.L’L). The functionality for the
following instruction classes are:
2-38 Vol. 2A
INSTRUCTION FORMAT
— Broadcasting a single element across the destination vector register: this applies to the instruction class
with Load+Op semantic where one of the source operand is from memory.
— Redirect L’L field (P[22:21]) as static rounding control for floating-point instructions with rounding
semantic. Static rounding control overrides MXCSR.RC field and implies “Suppress all exceptions” (SAE).
— Enable SAE for floating -point instructions with arithmetic semantic that is not rounding.
— For instruction classes outside of the afore-mentioned three classes, setting EVEX.b will cause #UD.
• Vector length/rounding control specifier: P[22:21] can serve one of three options.
— Vector length information for packed vector instructions.
— Ignored for instructions operating on vector register content as a single data element.
— Rounding control for floating-point instructions that have a rounding semantic and whose source and
destination operands are all vector registers.
Table 2-33. 32-Register Support in 64-bit Mode Using EVEX with Embedded REX Bits
41 3 [2:0] Reg. Type Common Usages
REG EVEX.R’ REX.R modrm.reg GPR, Vector Destination or Source
VVVV EVEX.V’ EVEX.vvvv GPR, Vector 2ndSource or Destination
RM EVEX.X EVEX.B modrm.r/m GPR, Vector 1st Source or Destination
BASE 0 EVEX.B modrm.r/m GPR memory addressing
INDEX 0 EVEX.X sib.index GPR memory addressing
VIDX EVEX.V’ EVEX.X sib.index Vector VSIB memory addressing
NOTES:
1. Not applicable for accessing general purpose registers.
The mapping of register operands used by various instruction syntax and memory addressing in 32-bit modes are
shown in Table 2-34.
Vol. 2A 2-39
INSTRUCTION FORMAT
NOTES:
1. Instructions that overwrite the conditional mask in opmask do not permit using k0 as the embedded mask.
2-40 Vol. 2A
INSTRUCTION FORMAT
1 64bit 1 {1tox} 8 8 8
0 32bit 0 none 8 16 32
Half Load+Op (Half Vector)
1 32bit 0 {1tox} 4 4 4
Table 2-37. EVEX DISP8*N for Instructions Not Affected by Embedded Broadcast
TupleType InputSize EVEX.W N (VL= 128) N (VL= 256) N (VL= 512) Comment
Full Mem N/A N/A 16 32 64 Load/store or subDword full vector
8bit N/A 1 1 1
16bit N/A 2 2 2
Tuple1 Scalar 1Tuple
32bit 0 4 4 4
64bit 1 8 8 8
32bit N/A 4 4 4 1 Tuple, memsize not affected by
Tuple1 Fixed
64bit N/A 8 8 8 EVEX.W
32bit 0 8 8 8
Tuple2 Broadcast (2 elements)
64bit 1 NA 16 16
32bit 0 NA 16 16
Tuple4 Broadcast (4 elements)
64bit 1 NA NA 32
Tuple8 32bit 0 NA NA 32 Broadcast (8 elements)
Half Mem N/A N/A 8 16 32 SubQword Conversion
Quarter Mem N/A N/A 4 8 16 SubDword Conversion
Vol. 2A 2-41
INSTRUCTION FORMAT
Table 2-37. EVEX DISP8*N for Instructions Not Affected by Embedded Broadcast (Contd.)
TupleType InputSize EVEX.W N (VL= 128) N (VL= 256) N (VL= 512) Comment
Eighth Mem N/A N/A 2 4 8 SubWord Conversion
Mem128 N/A N/A 16 16 16 Shift count from memory
MOVDDUP N/A N/A 8 32 64 VMOVDDUP
2-42 Vol. 2A
INSTRUCTION FORMAT
Table 2-38. EVEX Embedded Broadcast/Rounding/SAE and Vector Length on Vector Instructions
Position P2[4] P2[6:5] P2[6:5]
Broadcast/Rounding/SAE Context EVEX.b EVEX.L’L EVEX.RC
Reg-reg, FP Instructions w/ rounding semantic or SAE Enable static rounding Vector length Implied 00b: SAE + RNE
control (SAE implied) (512 bit or scalar) 01b: SAE + RD
10b: SAE + RU
11b: SAE + RZ
Load+op Instructions w/ memory source Broadcast Control 00b: 128-bit NA
01b: 256-bit
Other Instructions ( Must be 0 (otherwise NA
10b: 512-bit
Explicit Load/Store/Broadcast/Gather/Scatter) #UD)
11b: Reserved (#UD)
Vol. 2A 2-43
INSTRUCTION FORMAT
Table 2-42 lists the #UD conditions of instruction encoding of opmask register using EVEX.aaa and EVEX.z
2-44 Vol. 2A
INSTRUCTION FORMAT
Table 2-43 lists the #UD conditions of EVEX bit fields that depends on the context of EVEX.b.
Vol. 2A 2-45
INSTRUCTION FORMAT
2-46 Vol. 2A
INSTRUCTION FORMAT
Vol. 2A 2-47
INSTRUCTION FORMAT
2-48 Vol. 2A
INSTRUCTION FORMAT
Protected and
Virtual 80x86
Compatibility
64-bit
Real
General Protection, If fault suppression not set, and an illegal memory operand effective address in the
X
#GP(0) CS, DS, ES, FS or GS segments.
X If fault suppression not set, and the memory address is in a non-canonical form.
If fault suppression not set, and any part of the operand lies outside the effective
X X
address space from 0 to FFFFH.
Page Fault
X X X If fault suppression not set, and a page fault.
#PF(fault-code)
Vol. 2A 2-49
INSTRUCTION FORMAT
EVEX-encoded instructions with memory alignment restrictions, but do not support memory fault suppression
follow exception class E1NF.
Virtual 80x86
Protected and
Compatibility
64-bit
Real
Exception Cause of Exception
2-50 Vol. 2A
INSTRUCTION FORMAT
Protected and
Compatibility
Virtual 8086
64-bit
Real
Vol. 2A 2-51
INSTRUCTION FORMAT
Protected and
Virtual 80x86
Compatibility
64-bit
Real
Exception Cause of Exception
2-52 Vol. 2A
INSTRUCTION FORMAT
EVEX-encoded scalar instructions with arithmetic semantic that do not support memory fault suppression follow
exception class E3NF.
Virtual 80x86
Protected and
Compatibility
64-bit
Exception Real Cause of Exception
X X EVEX prefix.
X X X X If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 0.
If CR4.OSXSAVE[bit 18]=0.
If any one of following conditions applies:
• State requirement, Table 2-39 not met.
X X • Opcode independent #UD condition in Table 2-40.
Invalid Opcode, #UD • Operand encoding #UD conditions in Table 2-41.
• Opmask encoding #UD condition of Table 2-42.
• EVEX.b encoding #UD condition of Table 2-43.
X X X X If preceded by a LOCK prefix (F0H).
X X If any REX, F2, F3, or 66 prefixes precede a EVEX prefix.
X X X X If any corresponding CPUID feature flag is ‘0’.
Device Not Available,
X X X X If CR0.TS[bit 3]=1.
#NM
X For an illegal address in the SS segment.
Stack, #SS(0)
X If a memory address referencing the SS segment is in a non-canonical form.
For an illegal memory operand effective address in the CS, DS, ES, FS or GS seg-
X
ments.
General Protection,
X If the memory address is in a non-canonical form.
#GP(0)
If any part of the operand lies outside the effective address space from 0 to
X X
FFFFH.
Page Fault #PF(fault-
X X X For a page fault.
code)
Alignment Check For 2, 4, or 8 byte memory access if alignment checking is enabled and an
X X X
#AC(0) unaligned memory access is made while the current privilege level is 3.
SIMD Floating-point If an unmasked SIMD floating-point exception, {sae} or {er} not set, and CR4.OSX-
X X X X
Exception, #XM MMEXCPT[bit 10] = 1.
Vol. 2A 2-53
INSTRUCTION FORMAT
Protected and
Virtual 80x86
Compatibility
64-bit
Real
Exception Cause of Exception
2-54 Vol. 2A
INSTRUCTION FORMAT
EVEX-encoded vector instructions that do not cause SIMD FP exception nor support memory fault suppression
follow exception class E4NF.
Virtual 80x86
Protected and
Compatibility
64-bit
Exception Real Cause of Exception
Vol. 2A 2-55
INSTRUCTION FORMAT
Protected and
Virtual 80x86
Compatibility
64-bit
Real
Exception Cause of Exception
EVEX-encoded scalar/partial vector instructions that do not cause SIMD FP exception nor support memory fault
suppression follow exception class E5NF.
2-56 Vol. 2A
INSTRUCTION FORMAT
Protected and
Virtual 80x86
Compatibility
64-bit
Real
Exception Cause of Exception
Vol. 2A 2-57
INSTRUCTION FORMAT
Virtual 80x86
Protected and
Compatibility
64-bit
Real
Exception Cause of Exception
2-58 Vol. 2A
INSTRUCTION FORMAT
EVEX-encoded instructions that do not cause SIMD FP exception nor support memory fault suppression follow
exception class E6NF.
Virtual 80x86
Protected and
Compatibility
64-bit
Exception Real Cause of Exception
Vol. 2A 2-59
INSTRUCTION FORMAT
Protected and
Virtual 80x86
Compatibility
64-bit
Real
Exception Cause of Exception
2-60 Vol. 2A
INSTRUCTION FORMAT
Protected and
Virtual 80x86
Compatibility
64-bit
Real
Exception Cause of Exception
Vol. 2A 2-61
INSTRUCTION FORMAT
EVEX-encoded vector or partial-vector instructions that must be encoded with VEX.L’L = 0, do not cause SIMD FP
exception nor support memory fault suppression follow exception class E9NF.
Virtual 80x86
Protected and
Compatibility
64-bit
Exception Real Cause of Exception
2-62 Vol. 2A
INSTRUCTION FORMAT
Protected and
Virtual 80x86
Compatibility
64-bit
Real
Exception Cause of Exception
Vol. 2A 2-63
INSTRUCTION FORMAT
EVEX-encoded scalar instructions that ignore EVEX.L’L vector length encoding, do not cause a SIMD FP exception,
and do not support memory fault suppression follow exception class E10NF.
Virtual 80x86
Protected and
Compatibility
64-bit
Exception Real Cause of Exception
2-64 Vol. 2A
INSTRUCTION FORMAT
2.8.10 Exceptions Type E11 (EVEX-only, Mem Arg, No AC, Floating-point Exceptions)
EVEX-encoded instructions that can cause SIMD FP exception, memory operand support fault suppression but do
not cause #AC follow exception class E11.
Protected and
Virtual 80x86
Compatibility
64-bit
Real
Exception Cause of Exception
Vol. 2A 2-65
INSTRUCTION FORMAT
2.8.11 Exceptions Type E12 and E12NP (VSIB Mem Arg, No AC, No Floating-point Exceptions)
Virtual 80x86
Protected and
Compatibility
64-bit
Real
Exception Cause of Exception
2-66 Vol. 2A
INSTRUCTION FORMAT
EVEX-encoded prefetch instructions that do not cause #PF follow exception class E12NP.
Virtual 80x86
Protected and
Compatibility
64-bit
Real
Exception Cause of Exception
Vol. 2A 2-67
INSTRUCTION FORMAT
Table 2-65. TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg)
Virtual 80x86
Protected and
Compatibility
64-bit
Real
2-68 Vol. 2A
INSTRUCTION FORMAT
Table 2-66. TYPE K21 Exception Definition (VEX-Encoded OpMask Instructions Addressing Memory)
Virtual 80x86
Protected and
Compatibility
64-bit
Exception Real Cause of Exception
Vol. 2A 2-69
INSTRUCTION FORMAT
2-70 Vol. 2A
INSTRUCTION FORMAT
Vol. 2A 2-71
INSTRUCTION FORMAT
2-72 Vol. 2A
6. Updates to Chapter 3, Volume 2A
Change bars and violet text show changes to Chapter 3 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2A: Instruction Set Reference, A-L.
------------------------------------------------------------------------------------------
Changes to this chapter:
• Updated the BSF and BSR instructions to indicate that BSF and BSR leave the destination operand unmodified
if the source operand is zero. Added a footnote to confirm that using a 32-bit operand size on some older
processors may clear the upper 32 bits of a 64-bit destination while leaving the lower 32 bits unmodified.
• Added the CMPccXADD instruction.
• Updated the CPUID instruction to add enumeration for the AVX10, AVX-IFMA, AVX-NE-CONVERT, AVX-VNNI-
INT8, AVX-VNNI-INT16, CMPCCXADD, LAM, LASS, MSRLIST, PREFETCHI, WRMSRNS, SHA512, SM3, SM4, UC-
lock disable, and UIRET_UIF features. Added CPUID Leaf 24H. Corrected typos as needed.
• Added Intel® AVX10.1 information to the following instructions:
— ADDPD
— ADDPS
— ADDSD
— ADDSS
— AESDEC
— AESDECLAST
— AESENC
— AESENCLAST
— ANDNPD
— ANDNPS
— ANDPD
— ANDPS
— CMPPD
— CMPPS
— CMPSD
— CMPSS
— COMISD
— COMISS
— CVTDQ2PD
— CVTDQ2PS
— CVTPD2DQ
— CVTPD2PS
— CVTPS2DQ
— CVTPS2PD
— CVTSD2SI
— CVTSD2SS
— CVTSI2SD
— CVTSI2SS
— CVTSS2SD
— CVTSS2SI
— CVTTPD2DQ
— CVTTPS2DQ
— CVTTSD2SI
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Adds two, four or eight packed double precision floating-point values from the first source operand to the second
source operand, and stores the packed double precision floating-point result in the destination operand.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of
the corresponding ZMM register destination are zeroed.
VEX.128 encoded version: the first source operand is a XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper Bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] := SRC1[i+63:i] + SRC2[63:0]
ELSE
DEST[i+63:i] := SRC1[i+63:i] + SRC2[i+63:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Adds four, eight or sixteen packed single precision floating-point values from the first source operand with the
second source operand, and stores the packed single precision floating-point result in the destination operand.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of
the corresponding ZMM register destination are zeroed.
VEX.128 encoded version: the first source operand is a XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper Bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] := SRC1[i+31:i] + SRC2[31:0]
ELSE
DEST[i+31:i] := SRC1[i+31:i] + SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Adds the low double precision floating-point values from the second source operand and the first source operand
and stores the double precision floating-point result in the destination operand.
The second source operand can be an XMM register or a 64-bit memory location. The first source and destination
operands are XMM registers.
128-bit Legacy SSE version: The first source and destination operands are the same. Bits (MAXVL-1:64) of the
corresponding destination register remain unchanged.
EVEX and VEX.128 encoded version: The first source operand is encoded by EVEX.vvvv/VEX.vvvv. Bits (127:64) of
the XMM register destination are copied from corresponding bits in the first source operand. Bits (MAXVL-1:128) of
the destination register are zeroed.
EVEX version: The low quadword element of the destination is updated according to the writemask.
Software should ensure VADDSD is encoded with VEX.L=0. Encoding VADDSD with VEX.L=1 may encounter
unpredictable behavior across different processor generations.
Other Exceptions
VEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Adds the low single precision floating-point values from the second source operand and the first source operand,
and stores the double precision floating-point result in the destination operand.
The second source operand can be an XMM register or a 64-bit memory location. The first source and destination
operands are XMM registers.
128-bit Legacy SSE version: The first source and destination operands are the same. Bits (MAXVL-1:32) of the
corresponding the destination register remain unchanged.
EVEX and VEX.128 encoded version: The first source operand is encoded by EVEX.vvvv/VEX.vvvv. Bits (127:32) of
the XMM register destination are copied from corresponding bits in the first source operand. Bits (MAXVL-1:128) of
the destination register are zeroed.
EVEX version: The low doubleword element of the destination is updated according to the writemask.
Software should ensure VADDSS is encoded with VEX.L=0. Encoding VADDSS with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.
Other Exceptions
VEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a single round of the AES decryption flow using the Equivalent Inverse Cipher, using
one/two/four (depending on vector length) 128-bit data (state) from the first source operand with one/two/four
(depending on vector length) round key(s) from the second source operand, and stores the result in the destina-
tion operand.
Use the AESDEC instruction for all but the last decryption round. For the last decryption round, use the AESDE-
CLAST instruction.
VEX and EVEX encoded versions of the instruction allow 3-operand (non-destructive) operation. The legacy
encoded versions of the instruction require that the first source operand and the destination operand are the same
and must be an XMM register.
The EVEX encoded form of this instruction does not support memory fault suppression.
Other Exceptions
See Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded: See Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs the last round of the AES decryption flow using the Equivalent Inverse Cipher, using
one/two/four (depending on vector length) 128-bit data (state) from the first source operand with one/two/four
(depending on vector length) round key(s) from the second source operand, and stores the result in the destina-
tion operand.
VEX and EVEX encoded versions of the instruction allow 3-operand (non-destructive) operation. The legacy
encoded versions of the instruction require that the first source operand and the destination operand are the same
and must be an XMM register.
The EVEX encoded form of this instruction does not support memory fault suppression.
Other Exceptions
See Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded: See Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a single round of an AES encryption flow using one/two/four (depending on vector
length) 128-bit data (state) from the first source operand with one/two/four (depending on vector length) round
key(s) from the second source operand, and stores the result in the destination operand.
Use the AESENC instruction for all but the last encryption rounds. For the last encryption round, use the AESENC-
CLAST instruction.
VEX and EVEX encoded versions of the instruction allow 3-operand (non-destructive) operation. The legacy
encoded versions of the instruction require that the first source operand and the destination operand are the same
and must be an XMM register.
The EVEX encoded form of this instruction does not support memory fault suppression.
Other Exceptions
See Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded: See Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs the last round of an AES encryption flow using one/two/four (depending on vector length)
128-bit data (state) from the first source operand with one/two/four (depending on vector length) round key(s)
from the second source operand, and stores the result in the destination operand.
VEX and EVEX encoded versions of the instruction allows 3-operand (non-destructive) operation. The legacy
encoded versions of the instruction require that the first source operand and the destination operand are the same
and must be an XMM register.
The EVEX encoded form of this instruction does not support memory fault suppression.
Other Exceptions
See Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded: See Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a bitwise logical AND NOT of the two, four or eight packed double precision floating-point values from the
first source operand and the second source operand, and stores the result in the destination operand.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a
64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the
corresponding ZMM register destination are zeroed.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
register destination are unmodified.
ANDNPD—Bitwise Logical AND NOT of Packed Double Precision Floating-Point Values Vol. 2A 3-81
Operation
VANDNPD (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := (NOT(SRC1[i+63:i])) BITWISE AND SRC2[63:0]
ELSE
DEST[i+63:i] := (NOT(SRC1[i+63:i])) BITWISE AND SRC2[i+63:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] = 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
ANDNPD—Bitwise Logical AND NOT of Packed Double Precision Floating-Point Values Vol. 2A 3-82
Other Exceptions
VEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
ANDNPD—Bitwise Logical AND NOT of Packed Double Precision Floating-Point Values Vol. 2A 3-83
ANDNPS—Bitwise Logical AND NOT of Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 55 /r A V/V SSE Return the bitwise logical AND NOT of packed single
ANDNPS xmm1, xmm2/m128 precision floating-point values in xmm1 and xmm2/mem.
VEX.128.0F 55 /r B V/V AVX Return the bitwise logical AND NOT of packed single
VANDNPS xmm1, xmm2, precision floating-point values in xmm2 and xmm3/mem.
xmm3/m128
VEX.256.0F 55 /r B V/V AVX Return the bitwise logical AND NOT of packed single
VANDNPS ymm1, ymm2, precision floating-point values in ymm2 and ymm3/mem.
ymm3/m256
EVEX.128.0F.W0 55 /r C V/V (AVX512VL AND Return the bitwise logical AND of packed single precision
VANDNPS xmm1 {k1}{z}, xmm2, AVX512DQ) OR floating-point values in xmm2 and xmm3/m128/m32bcst
xmm3/m128/m32bcst AVX10.11 subject to writemask k1.
EVEX.256.0F.W0 55 /r C V/V (AVX512VL AND Return the bitwise logical AND of packed single precision
VANDNPS ymm1 {k1}{z}, ymm2, AVX512DQ) OR floating-point values in ymm2 and ymm3/m256/m32bcst
ymm3/m256/m32bcst AVX10.11 subject to writemask k1.
EVEX.512.0F.W0 55 /r C V/V AVX512DQ Return the bitwise logical AND of packed single precision
VANDNPS zmm1 {k1}{z}, zmm2, OR AVX10.11 floating-point values in zmm2 and zmm3/m512/m32bcst
zmm3/m512/m32bcst subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a bitwise logical AND NOT of the four, eight or sixteen packed single precision floating-point values from
the first source operand and the second source operand, and stores the result in the destination operand.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a
32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the
corresponding ZMM register destination are zeroed.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.
ANDNPS—Bitwise Logical AND NOT of Packed Single Precision Floating-Point Values Vol. 2A 3-84
Operation
VANDNPS (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN
DEST[i+31:i] := (NOT(SRC1[i+31:i])) BITWISE AND SRC2[31:0]
ELSE
DEST[i+31:i] := (NOT(SRC1[i+31:i])) BITWISE AND SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] = 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
ANDNPS—Bitwise Logical AND NOT of Packed Single Precision Floating-Point Values Vol. 2A 3-85
Intel C/C++ Compiler Intrinsic Equivalent
VANDNPS __m512 _mm512_andnot_ps (__m512 a, __m512 b);
VANDNPS __m512 _mm512_mask_andnot_ps (__m512 s, __mmask16 k, __m512 a, __m512 b);
VANDNPS __m512 _mm512_maskz_andnot_ps (__mmask16 k, __m512 a, __m512 b);
VANDNPS __m256 _mm256_mask_andnot_ps (__m256 s, __mmask8 k, __m256 a, __m256 b);
VANDNPS __m256 _mm256_maskz_andnot_ps (__mmask8 k, __m256 a, __m256 b);
VANDNPS __m128 _mm_mask_andnot_ps (__m128 s, __mmask8 k, __m128 a, __m128 b);
VANDNPS __m128 _mm_maskz_andnot_ps (__mmask8 k, __m128 a, __m128 b);
VANDNPS __m256 _mm256_andnot_ps (__m256 a, __m256 b);
ANDNPS __m128 _mm_andnot_ps (__m128 a, __m128 b);
Other Exceptions
VEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
ANDNPS—Bitwise Logical AND NOT of Packed Single Precision Floating-Point Values Vol. 2A 3-86
ANDPD—Bitwise Logical AND of Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 54 /r A V/V SSE2 Return the bitwise logical AND of packed double
ANDPD xmm1, xmm2/m128 precision floating-point values in xmm1 and
xmm2/mem.
VEX.128.66.0F 54 /r B V/V AVX Return the bitwise logical AND of packed double
VANDPD xmm1, xmm2, xmm3/m128 precision floating-point values in xmm2 and
xmm3/mem.
VEX.256.66.0F 54 /r B V/V AVX Return the bitwise logical AND of packed double
VANDPD ymm1, ymm2, ymm3/m256 precision floating-point values in ymm2 and
ymm3/mem.
EVEX.128.66.0F.W1 54 /r C V/V (AVX512VL AND Return the bitwise logical AND of packed double
VANDPD xmm1 {k1}{z}, xmm2, AVX512DQ) OR precision floating-point values in xmm2 and
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst subject to writemask k1.
EVEX.256.66.0F.W1 54 /r C V/V (AVX512VL AND Return the bitwise logical AND of packed double
VANDPD ymm1 {k1}{z}, ymm2, AVX512DQ) OR precision floating-point values in ymm2 and
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst subject to writemask k1.
EVEX.512.66.0F.W1 54 /r C V/V AVX512DQ Return the bitwise logical AND of packed double
VANDPD zmm1 {k1}{z}, zmm2, OR AVX10.11 precision floating-point values in zmm2 and
zmm3/m512/m64bcst zmm3/m512/m64bcst subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a bitwise logical AND of the two, four or eight packed double precision floating-point values from the first
source operand and the second source operand, and stores the result in the destination operand.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a
64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the
corresponding ZMM register destination are zeroed.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
register destination are unmodified.
ANDPD—Bitwise Logical AND of Packed Double Precision Floating-Point Values Vol. 2A 3-87
Operation
VANDPD (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := SRC1[i+63:i] BITWISE AND SRC2[63:0]
ELSE
DEST[i+63:i] := SRC1[i+63:i] BITWISE AND SRC2[i+63:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] = 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
ANDPD—Bitwise Logical AND of Packed Double Precision Floating-Point Values Vol. 2A 3-88
Other Exceptions
VEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
ANDPD—Bitwise Logical AND of Packed Double Precision Floating-Point Values Vol. 2A 3-89
ANDPS—Bitwise Logical AND of Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 54 /r A V/V SSE Return the bitwise logical AND of packed single precision
ANDPS xmm1, xmm2/m128 floating-point values in xmm1 and xmm2/mem.
VEX.128.0F 54 /r B V/V AVX Return the bitwise logical AND of packed single precision
VANDPS xmm1,xmm2, floating-point values in xmm2 and xmm3/mem.
xmm3/m128
VEX.256.0F 54 /r B V/V AVX Return the bitwise logical AND of packed single precision
VANDPS ymm1, ymm2, floating-point values in ymm2 and ymm3/mem.
ymm3/m256
EVEX.128.0F.W0 54 /r C V/V (AVX512VL AND Return the bitwise logical AND of packed single precision
VANDPS xmm1 {k1}{z}, xmm2, AVX512DQ) OR floating-point values in xmm2 and xmm3/m128/m32bcst
xmm3/m128/m32bcst AVX10.11 subject to writemask k1.
EVEX.256.0F.W0 54 /r C V/V (AVX512VL AND Return the bitwise logical AND of packed single precision
VANDPS ymm1 {k1}{z}, ymm2, AVX512DQ) OR floating-point values in ymm2 and ymm3/m256/m32bcst
ymm3/m256/m32bcst AVX10.11 subject to writemask k1.
EVEX.512.0F.W0 54 /r C V/V AVX512DQ Return the bitwise logical AND of packed single precision
VANDPS zmm1 {k1}{z}, zmm2, OR AVX10.11 floating-point values in zmm2 and zmm3/m512/m32bcst
zmm3/m512/m32bcst subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a bitwise logical AND of the four, eight or sixteen packed single precision floating-point values from the
first source operand and the second source operand, and stores the result in the destination operand.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a
32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the
corresponding ZMM register destination are zeroed.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.
ANDPS—Bitwise Logical AND of Packed Single Precision Floating-Point Values Vol. 2A 3-90
Operation
VANDPS (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := SRC1[i+31:i] BITWISE AND SRC2[31:0]
ELSE
DEST[i+31:i] := SRC1[i+31:i] BITWISE AND SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0;
ANDPS—Bitwise Logical AND of Packed Single Precision Floating-Point Values Vol. 2A 3-91
Intel C/C++ Compiler Intrinsic Equivalent
VANDPS __m512 _mm512_and_ps (__m512 a, __m512 b);
VANDPS __m512 _mm512_mask_and_ps (__m512 s, __mmask16 k, __m512 a, __m512 b);
VANDPS __m512 _mm512_maskz_and_ps (__mmask16 k, __m512 a, __m512 b);
VANDPS __m256 _mm256_mask_and_ps (__m256 s, __mmask8 k, __m256 a, __m256 b);
VANDPS __m256 _mm256_maskz_and_ps (__mmask8 k, __m256 a, __m256 b);
VANDPS __m128 _mm_mask_and_ps (__m128 s, __mmask8 k, __m128 a, __m128 b);
VANDPS __m128 _mm_maskz_and_ps (__mmask8 k, __m128 a, __m128 b);
VANDPS __m256 _mm256_and_ps (__m256 a, __m256 b);
ANDPS __m128 _mm_and_ps (__m128 a, __m128 b);
Other Exceptions
VEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
ANDPS—Bitwise Logical AND of Packed Single Precision Floating-Point Values Vol. 2A 3-92
BSF—Bit Scan Forward
Opcode Instruction Op/ 64-bit Compat/ Description
En Mode Leg Mode
0F BC /r BSF r16, r/m16 RM Valid Valid Bit scan forward on r/m16.
0F BC /r BSF r32, r/m32 RM Valid Valid Bit scan forward on r/m32.
REX.W + 0F BC /r BSF r64, r/m64 RM Valid N.E. Bit scan forward on r/m64.
Description
Searches the source operand (second operand) for the least significant set bit (1 bit). If a least significant 1 bit is
found, its bit index is stored in the destination operand (first operand). The source operand can be a register or a
memory location; the destination operand is a register. The bit index is an unsigned offset from bit 0 of the source
operand. If the content of the source operand is zero, the destination operand is unmodified.1
In 64-bit mode, the instruction’s default operation size is 32 bits. Using a REX prefix in the form of REX.R permits
access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See
the summary chart at the beginning of this section for encoding data and limits.
Operation
IF SRC = 0
THEN
ZF := 1;
DEST is undefined;
ELSE
ZF := 0;
temp := 0;
WHILE Bit(SRC, temp) = 0
DO
temp := temp + 1;
OD;
DEST := temp;
FI;
Flags Affected
The ZF flag is set to 1 if the source operand is 0; otherwise, the ZF flag is cleared. The CF, OF, SF, AF, and PF flags
are undefined.
1. On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower
32 bits unmodified.
Description
Searches the source operand (second operand) for the most significant set bit (1 bit). If a most significant 1 bit is
found, its bit index is stored in the destination operand (first operand). The source operand can be a register or a
memory location; the destination operand is a register. The bit index is an unsigned offset from bit 0 of the source
operand. If the content source operand is zero, the destination operand is unmodified.1
In 64-bit mode, the instruction’s default operation size is 32 bits. Using a REX prefix in the form of REX.R permits
access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits. See
the summary chart at the beginning of this section for encoding data and limits.
Operation
IF SRC = 0
THEN
ZF := 1;
DEST is undefined;
ELSE
ZF := 0;
temp := OperandSize – 1;
WHILE Bit(SRC, temp) = 0
DO
temp := temp - 1;
OD;
DEST := temp;
FI;
Flags Affected
The ZF flag is set to 1 if the source operand is 0; otherwise, the ZF flag is cleared. The CF, OF, SF, AF, and PF flags
are undefined.
1. On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower
32 bits unmodified.
VEX.128.66.0F38.W0 E6 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If below or equal (CF=1 or ZF=1),
CMPBEXADD m32, r32, r32
add value from r32 (third operand) to m32 and
write new value in m32. The second operand is
always updated with the original value from
m32.
VEX.128.66.0F38.W1 E6 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If below or equal (CF=1 or ZF=1),
CMPBEXADD m64, r64, r64
add value from r64 (third operand) to m64 and
write new value in m64. The second operand is
always updated with the original value from
m64.
VEX.128.66.0F38.W0 E2 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If below (CF=1), add value from
CMPBXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 E2 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If below (CF=1), add value from
CMPBXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 EE !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If less or equal (ZF=1 or SF≠OF),
CMPLEXADD m32, r32, r32
add value from r32 (third operand) to m32 and
write new value in m32. The second operand is
always updated with the original value from
m32.
VEX.128.66.0F38.W1 EE !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If less or equal (ZF=1 or SF≠OF),
CMPLEXADD m64, r64, r64
add value from r64 (third operand) to m64 and
write new value in m64. The second operand is
always updated with the original value from
m64.
VEX.128.66.0F38.W0 EC !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If less (SF≠OF), add value from r32
CMPLXADD m32, r32, r32
(third operand) to m32 and write new value in
m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 EC !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
CMPLXADD m64, r64, r64 value in m64. If less (SF≠OF), add value from r64
(third operand) to m64 and write new value in
m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 E7 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
CMPNBEXADD m32, r32, r32 value in m32. If not below or equal (CF=0 and
ZF=0), add value from r32 (third operand) to
m32 and write new value in m32. The second
operand is always updated with the original
value from m32.
VEX.128.66.0F38.W1 E3 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not below (CF=0), add value from
CMPNBXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 EF !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not less or equal (ZF=0 and
CMPNLEXADD m32, r32, r32
SF=OF), add value from r32 (third operand) to
m32 and write new value in m32. The second
operand is always updated with the original
value from m32.
VEX.128.66.0F38.W1 EF !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not less or equal (ZF=0 and
CMPNLEXADD m64, r64, r64
SF=OF), add value from r64 (third operand) to
m64 and write new value in m64. The second
operand is always updated with the original
value from m64.
VEX.128.66.0F38.W0 ED !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not less (SF=OF), add value from
CMPNLXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 ED !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not less (SF=OF), add value from
CMPNLXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 E1 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not overflow (OF=0), add value
CMPNOXADD m32, r32, r32
from r32 (third operand) to m32 and write new
value in m32. The second operand is always
updated with the original value from m32.
VEX.128.66.0F38.W1 E1 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not overflow (OF=0), add value
CMPNOXADD m64, r64, r64
from r64 (third operand) to m64 and write new
value in m64. The second operand is always
updated with the original value from m64.
VEX.128.66.0F38.W0 EB !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not parity (PF=0), add value from
CMPNPXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 EB !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not parity (PF=0), add value from
CMPNPXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 E9 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not sign (SF=0), add value from
CMPNSXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 E9 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not sign (SF=0), add value from
CMPNSXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 E5 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If not zero (ZF=0), add value from
CMPNZXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 E5 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If not zero (ZF=0), add value from
CMPNZXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 E0 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If overflow (OF=1), add value from
CMPOXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 E0 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If overflow (OF=1), add value from
CMPOXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 EA !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If parity (PF=1), add value from
CMPPXADD m32, r32, r32
r32 (third operand) to m32 and write new value
in m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 EA !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If parity (PF=1), add value from
CMPPXADD m64, r64, r64
r64 (third operand) to m64 and write new value
in m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 E8 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If sign (SF=1), add value from r32
CMPSXADD m32, r32, r32
(third operand) to m32 and write new value in
m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 E8 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If sign (SF=1), add value from r64
CMPSXADD m64, r64, r64
(third operand) to m64 and write new value in
m64. The second operand is always updated
with the original value from m64.
VEX.128.66.0F38.W0 E4 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r32 (second operand) with
value in m32. If zero (ZF=1), add value from r32
CMPZXADD m32, r32, r32
(third operand) to m32 and write new value in
m32. The second operand is always updated
with the original value from m32.
VEX.128.66.0F38.W1 E4 !(11):rrr:bbb A V/N.E. CMPCCXADD Compare value in r64 (second operand) with
value in m64. If zero (ZF=1), add value from r64
CMPZXADD m64, r64, r64
(third operand) to m64 and write new value in
m64. The second operand is always updated
with the original value from m64.
Description
This instruction compares the value from memory with the value of the second operand. If the specified condition
is met, then the processor will add the third operand to the memory operand and write it into memory, else the
memory is unchanged by this instruction.
This instruction must have MODRM.MOD equal to 0, 1, or 2. The value 3 for MODRM.MOD is reserved and will cause
an invalid opcode exception (#UD).
The second operand is always updated with the original value of the memory operand. The EFLAGS conditions are
updated from the results of the comparison.The instruction uses an implicit lock. This instruction does not permit
the use of an explicit lock prefix.
Operation
CMPCCXADD srcdest1, srcdest2, src3
tmp1 := load lock srcdest1
tmp2 := tmp1 + src3
EFLAGS.CS,OF,SF,ZF,AF,PF := CMP tmp1, srcdest2
IF <condition>:
srcdest1 := store unlock tmp2
ELSE
srcdest1 := store unlock tmp1
srcdest2 :=tmp1
1. ModRM.MOD != 011B
Exceptions
Exceptions Type 14; see Table 2-31.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed double precision floating-point values in the second source operand and
the first source operand and returns the result of the comparison to the destination operand. The comparison pred-
icate operand (immediate byte) specifies the type of comparison performed on each pair of packed values in the
two source operands.
EVEX encoded versions: The first source operand (second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand (first operand) is an opmask register.
Comparison results are written to the destination operand under the writemask k2. Each comparison result is a
single mask bit of 1 (comparison true) or 0 (comparison false).
VEX.256 encoded version: The first source operand (second operand) is a YMM register. The second source
operand (third operand) can be a YMM register or a 256-bit memory location. The destination operand (first
EQ_US 18H Equal (unordered, signaling) False False True True Yes
NGE_UQ 19H Not-greater-than-or-equal (unordered, non- False True False True No
signaling)
NGT_UQ 1AH Not-greater-than (unordered, nonsignaling) False True True True No
FALSE_OS 1BH False (ordered, signaling) False False False False Yes
NEQ_OS 1CH Not-equal (ordered, signaling) True True False False Yes
GE_OQ 1DH Greater-than-or-equal (ordered, nonsignal- True False True False No
ing)
GT_OQ 1EH Greater-than (ordered, nonsignaling) True False False False No
TRUE_US 1FH True (unordered, signaling) True True True True Yes
NOTES:
1. If either operand A or B is a NAN.
The unordered relationship is true when at least one of the two source operands being compared is a NaN; the
ordered relationship is true when neither source operand is a NaN.
A subsequent computational instruction that uses the mask result in the destination operand as an input operand
will not generate an exception, because a mask of all 0s corresponds to a floating-point value of +0.0 and a mask
of all 1s corresponds to a QNaN.
Note that processors with “CPUID.1H:ECX.AVX =0” do not implement the “greater-than”, “greater-than-or-equal”,
“not-greater than”, and “not-greater-than-or-equal relations” predicates. These comparisons can be made either
by using the inverse relationship (that is, use the “not-less-than-or-equal” to make a “greater-than” comparison)
or by using software emulation. When using software emulation, the program must swap the operands (copying
registers when necessary to protect the data that will now be in the destination), and then perform the compare
using a different predicate. The predicate to be used for these emulations is listed in the first 8 rows of Table 3-7
(Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A) under the heading Emulation.
Compilers and assemblers may implement the following two-operand pseudo-ops in addition to the three-operand
CMPPD instruction, for processors with “CPUID.1H:ECX.AVX =0”. See Table 3-9. The compiler should treat
reserved imm8 values as illegal syntax.
Table 3-9. Pseudo-Op and CMPPD Implementation
:
The greater-than relations that the processor does not implement require more than one instruction to emulate in
software and therefore should not be implemented as pseudo-ops. (For these, the programmer should reverse the
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed single precision floating-point values in the second source operand and
the first source operand and returns the result of the comparison to the destination operand. The comparison pred-
icate operand (immediate byte) specifies the type of comparison performed on each of the pairs of packed values.
EVEX encoded versions: The first source operand (second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand (first operand) is an opmask register.
Comparison results are written to the destination operand under the writemask k2. Each comparison result is a
single mask bit of 1 (comparison true) or 0 (comparison false).
VEX.256 encoded version: The first source operand (second operand) is a YMM register. The second source
operand (third operand) can be a YMM register or a 256-bit memory location. The destination operand (first
operand) is a YMM register. Eight comparisons are performed with results written to the destination operand. The
result of each comparison is a doubleword mask of all 1s (comparison true) or all 0s (comparison false).
128-bit Legacy SSE version: The first source and destination operand (first operand) is an XMM register. The
second source operand (second operand) can be an XMM register or 128-bit memory location. Bits (MAXVL-1:128)
The greater-than relations that the processor does not implement require more than one instruction to emulate in
software and therefore should not be implemented as pseudo-ops. (For these, the programmer should reverse the
operands of the corresponding less than relations and use move instructions to ensure that the mask is moved to
the correct destination register and that the source operand is left intact.)
Processors with “CPUID.1H:ECX.AVX =1” implement the full complement of 32 predicates shown in Table 3-12,
software emulation is no longer needed. Compilers and assemblers may implement the following three-operand
pseudo-ops in addition to the four-operand VCMPPS instruction. See Table 3-12, where the notation of reg1 and
reg2 represent either XMM registers or YMM registers. The compiler should treat reserved imm8 values as illegal
syntax. Alternately, intrinsics can map the pseudo-ops to pre-defined constants to support a simpler intrinsic inter-
face. Compilers and assemblers may implement three-operand pseudo-ops for EVEX encoded VCMPPS instructions
in a similar fashion by extending the syntax listed in Table 3-12.
:
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Compares the low double precision floating-point values in the second source operand and the first source operand
and returns the result of the comparison to the destination operand. The comparison predicate operand (imme-
diate operand) specifies the type of comparison performed.
128-bit Legacy SSE version: The first source and destination operand (first operand) is an XMM register. The
second source operand (second operand) can be an XMM register or 64-bit memory location. Bits (MAXVL-1:64) of
the corresponding YMM destination register remain unchanged. The comparison result is a quadword mask of all 1s
(comparison true) or all 0s (comparison false).
VEX.128 encoded version: The first source operand (second operand) is an XMM register. The second source
operand (third operand) can be an XMM register or a 64-bit memory location. The result is stored in the low quad-
word of the destination operand; the high quadword is filled with the contents of the high quadword of the first
source operand. Bits (MAXVL-1:128) of the destination ZMM register are zeroed. The comparison result is a quad-
word mask of all 1s (comparison true) or all 0s (comparison false).
EVEX encoded version: The first source operand (second operand) is an XMM register. The second source operand
can be a XMM register or a 64-bit memory location. The destination operand (first operand) is an opmask register.
The comparison result is a single mask bit of 1 (comparison true) or 0 (comparison false), written to the destination
starting from the LSB according to the writemask k2. Bits (MAX_KL-1:128) of the destination register are cleared.
The comparison predicate operand is an 8-bit immediate:
• For instructions encoded using the VEX prefix, bits 4:0 define the type of comparison to be performed (see
Table 3-8). Bits 5 through 7 of the immediate are reserved.
• For instruction encodings that do not use VEX prefix, bits 2:0 define the type of comparison to be made (see
the first 8 rows of Table 3-8). Bits 3 through 7 of the immediate are reserved.
The unordered relationship is true when at least one of the two source operands being compared is a NaN; the
ordered relationship is true when neither source operand is a NaN.
The greater-than relations that the processor does not implement require more than one instruction to emulate in
software and therefore should not be implemented as pseudo-ops. (For these, the programmer should reverse the
operands of the corresponding less than relations and use move instructions to ensure that the mask is moved to
the correct destination register and that the source operand is left intact.)
Processors with “CPUID.1H:ECX.AVX =1” implement the full complement of 32 predicates shown in Table 3-14,
software emulation is no longer needed. Compilers and assemblers may implement the following three-operand
pseudo-ops in addition to the four-operand VCMPSD instruction. See Table 3-14, where the notations of reg1 reg2,
and reg3 represent either XMM registers or YMM registers. The compiler should treat reserved imm8 values as
illegal syntax. Alternately, intrinsics can map the pseudo-ops to pre-defined constants to support a simpler intrinsic
interface. Compilers and assemblers may implement three-operand pseudo-ops for EVEX encoded VCMPSD
instructions in a similar fashion by extending the syntax listed in Table 3-14.
Table 3-14. Pseudo-Op and VCMPSD Implementation
:
Software should ensure VCMPSD is encoded with VEX.L=0. Encoding VCMPSD with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.
Operation
CASE (COMPARISON PREDICATE) OF
0: OP3 := EQ_OQ; OP5 := EQ_OQ;
1: OP3 := LT_OS; OP5 := LT_OS;
2: OP3 := LE_OS; OP5 := LE_OS;
3: OP3 := UNORD_Q; OP5 := UNORD_Q;
4: OP3 := NEQ_UQ; OP5 := NEQ_UQ;
5: OP3 := NLT_US; OP5 := NLT_US;
6: OP3 := NLE_US; OP5 := NLE_US;
7: OP3 := ORD_Q; OP5 := ORD_Q;
8: OP5 := EQ_UQ;
9: OP5 := NGE_US;
10: OP5 := NGT_US;
11: OP5 := FALSE_OQ;
12: OP5 := NEQ_OQ;
13: OP5 := GE_OS;
14: OP5 := GT_OS;
15: OP5 := TRUE_UQ;
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Compares the low single precision floating-point values in the second source operand and the first source operand
and returns the result of the comparison to the destination operand. The comparison predicate operand (imme-
diate operand) specifies the type of comparison performed.
128-bit Legacy SSE version: The first source and destination operand (first operand) is an XMM register. The
second source operand (second operand) can be an XMM register or 32-bit memory location. Bits (MAXVL-1:32) of
the corresponding YMM destination register remain unchanged. The comparison result is a doubleword mask of all
1s (comparison true) or all 0s (comparison false).
VEX.128 encoded version: The first source operand (second operand) is an XMM register. The second source
operand (third operand) can be an XMM register or a 32-bit memory location. The result is stored in the low 32 bits
of the destination operand; bits 127:32 of the destination operand are copied from the first source operand. Bits
(MAXVL-1:128) of the destination ZMM register are zeroed. The comparison result is a doubleword mask of all 1s
(comparison true) or all 0s (comparison false).
EVEX encoded version: The first source operand (second operand) is an XMM register. The second source operand
can be a XMM register or a 32-bit memory location. The destination operand (first operand) is an opmask register.
The comparison result is a single mask bit of 1 (comparison true) or 0 (comparison false), written to the destination
starting from the LSB according to the writemask k2. Bits (MAX_KL-1:128) of the destination register are cleared.
The comparison predicate operand is an 8-bit immediate:
• For instructions encoded using the VEX prefix, bits 4:0 define the type of comparison to be performed (see
Table 3-8). Bits 5 through 7 of the immediate are reserved.
• For instruction encodings that do not use VEX prefix, bits 2:0 define the type of comparison to be made (see
the first 8 rows of Table 3-8). Bits 3 through 7 of the immediate are reserved.
The unordered relationship is true when at least one of the two source operands being compared is a NaN; the
ordered relationship is true when neither source operand is a NaN.
The greater-than relations that the processor does not implement require more than one instruction to emulate in
software and therefore should not be implemented as pseudo-ops. (For these, the programmer should reverse the
operands of the corresponding less than relations and use move instructions to ensure that the mask is moved to
the correct destination register and that the source operand is left intact.)
Processors with “CPUID.1H:ECX.AVX =1” implement the full complement of 32 predicates shown in Table 3-14,
software emulation is no longer needed. Compilers and assemblers may implement the following three-operand
pseudo-ops in addition to the four-operand VCMPSS instruction. See Table 3-16, where the notations of reg1 reg2,
and reg3 represent either XMM registers or YMM registers. The compiler should treat reserved imm8 values as
illegal syntax. Alternately, intrinsics can map the pseudo-ops to pre-defined constants to support a simpler intrinsic
interface. Compilers and assemblers may implement three-operand pseudo-ops for EVEX encoded VCMPSS
instructions in a similar fashion by extending the syntax listed in Table 3-16.
Table 3-16. Pseudo-Op and VCMPSS Implementation
:
Software should ensure VCMPSS is encoded with VEX.L=0. Encoding VCMPSS with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.
Operation
CASE (COMPARISON PREDICATE) OF
0: OP3 := EQ_OQ; OP5 := EQ_OQ;
1: OP3 := LT_OS; OP5 := LT_OS;
2: OP3 := LE_OS; OP5 := LE_OS;
3: OP3 := UNORD_Q; OP5 := UNORD_Q;
4: OP3 := NEQ_UQ; OP5 := NEQ_UQ;
5: OP3 := NLT_US; OP5 := NLT_US;
6: OP3 := NLE_US; OP5 := NLE_US;
7: OP3 := ORD_Q; OP5 := ORD_Q;
8: OP5 := EQ_UQ;
9: OP5 := NGE_US;
10: OP5 := NGT_US;
11: OP5 := FALSE_OQ;
12: OP5 := NEQ_OQ;
13: OP5 := GE_OS;
14: OP5 := GT_OS;
15: OP5 := TRUE_UQ;
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Compares the double precision floating-point values in the low quadwords of operand 1 (first operand) and operand
2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unordered,
greater than, less than, or equal). The OF, SF, and AF flags in the EFLAGS register are set to 0. The unordered result
is returned if either source operand is a NaN (QNaN or SNaN).
Operand 1 is an XMM register; operand 2 can be an XMM register or a 64 bit memory location. The COMISD instruc-
tion differs from the UCOMISD instruction in that it signals a SIMD floating-point invalid operation exception (#I)
when a source operand is either a QNaN or SNaN. The UCOMISD instruction signals an invalid operation exception
only if a source operand is an SNaN.
The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated.
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
Software should ensure VCOMISD is encoded with VEX.L=0. Encoding VCOMISD with VEX.L=1 may encounter
unpredictable behavior across different processor generations.
Operation
COMISD (All Versions)
RESULT :=OrderedCompare(DEST[63:0] <> SRC[63:0]) {
(* Set EFLAGS *) CASE (RESULT) OF
UNORDERED: ZF,PF,CF := 111;
GREATER_THAN: ZF,PF,CF := 000;
LESS_THAN: ZF,PF,CF := 001;
EQUAL: ZF,PF,CF := 100;
ESAC;
OF, AF, SF := 0; }
COMISD—Compare Scalar Ordered Double Precision Floating-Point Values and Set EFLAGS Vol. 2A 3-218
Intel C/C++ Compiler Intrinsic Equivalent
VCOMISD int _mm_comi_round_sd(__m128d a, __m128d b, int imm, int sae);
VCOMISD int _mm_comieq_sd (__m128d a, __m128d b)
VCOMISD int _mm_comilt_sd (__m128d a, __m128d b)
VCOMISD int _mm_comile_sd (__m128d a, __m128d b)
VCOMISD int _mm_comigt_sd (__m128d a, __m128d b)
VCOMISD int _mm_comige_sd (__m128d a, __m128d b)
VCOMISD int _mm_comineq_sd (__m128d a, __m128d b)
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
COMISD—Compare Scalar Ordered Double Precision Floating-Point Values and Set EFLAGS Vol. 2A 3-219
COMISS—Compare Scalar Ordered Single Precision Floating-Point Values and Set EFLAGS
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
NP 0F 2F /r A V/V SSE Compare low single precision floating-point values in
COMISS xmm1, xmm2/m32 xmm1 and xmm2/mem32 and set the EFLAGS flags
accordingly.
VEX.LIG.0F.WIG 2F /r A V/V AVX Compare low single precision floating-point values in
VCOMISS xmm1, xmm2/m32 xmm1 and xmm2/mem32 and set the EFLAGS flags
accordingly.
EVEX.LLIG.0F.W0 2F /r B V/V AVX512F Compare low single precision floating-point values in
VCOMISS xmm1, xmm2/m32{sae} OR AVX10.11 xmm1 and xmm2/mem32 and set the EFLAGS flags
accordingly.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Compares the single precision floating-point values in the low quadwords of operand 1 (first operand) and operand
2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unordered,
greater than, less than, or equal). The OF, SF, and AF flags in the EFLAGS register are set to 0. The unordered result
is returned if either source operand is a NaN (QNaN or SNaN).
Operand 1 is an XMM register; operand 2 can be an XMM register or a 32 bit memory location.
The COMISS instruction differs from the UCOMISS instruction in that it signals a SIMD floating-point invalid opera-
tion exception (#I) when a source operand is either a QNaN or SNaN. The UCOMISS instruction signals an invalid
operation exception only if a source operand is an SNaN.
The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated.
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
Software should ensure VCOMISS is encoded with VEX.L=0. Encoding VCOMISS with VEX.L=1 may encounter
unpredictable behavior across different processor generations.
Operation
COMISS (All Versions)
RESULT :=OrderedCompare(DEST[31:0] <> SRC[31:0]) {
(* Set EFLAGS *) CASE (RESULT) OF
UNORDERED: ZF,PF,CF := 111;
GREATER_THAN: ZF,PF,CF := 000;
LESS_THAN: ZF,PF,CF := 001;
EQUAL: ZF,PF,CF := 100;
ESAC;
OF, AF, SF := 0; }
COMISS—Compare Scalar Ordered Single Precision Floating-Point Values and Set EFLAGS Vol. 2A 3-220
Intel C/C++ Compiler Intrinsic Equivalent
VCOMISS int _mm_comi_round_ss(__m128 a, __m128 b, int imm, int sae);
VCOMISS int _mm_comieq_ss (__m128 a, __m128 b)
VCOMISS int _mm_comilt_ss (__m128 a, __m128 b)
VCOMISS int _mm_comile_ss (__m128 a, __m128 b)
VCOMISS int _mm_comigt_ss (__m128 a, __m128 b)
VCOMISS int _mm_comige_ss (__m128 a, __m128 b)
VCOMISS int _mm_comineq_ss (__m128 a, __m128 b)
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
COMISS—Compare Scalar Ordered Single Precision Floating-Point Values and Set EFLAGS Vol. 2A 3-221
CPUID—CPU Identification
Opcode Instruction Op/ 64-Bit Compat/ Description
En Mode Leg Mode
0F A2 CPUID ZO Valid Valid Returns processor identification and feature
information to the EAX, EBX, ECX, and EDX
registers, as determined by input entered in
EAX (in some cases, ECX as well).
Description
The ID flag (bit 21) in the EFLAGS register indicates support for the CPUID instruction. If a software procedure can
set and clear this flag, the processor executing the procedure supports the CPUID instruction. This instruction
operates the same in non-64-bit modes and 64-bit mode.
CPUID returns processor identification and feature information in the EAX, EBX, ECX, and EDX registers.1 The
instruction’s output is dependent on the contents of the EAX register upon execution (in some cases, ECX as well).
For example, the following pseudocode loads EAX with 00H and causes CPUID to return a Maximum Return Value
and the Vendor Identification String in the appropriate registers:
MOV EAX, 00H
CPUID
Table 3-17 shows information returned, depending on the initial value loaded into the EAX register.
Two types of information are returned: basic and extended function information. If a value entered for CPUID.EAX
is higher than the maximum input value for basic or extended function for that processor then the data for the
highest basic information leaf is returned. For example, using some Intel processors, the following is true:
CPUID.EAX = 05H (* Returns MONITOR/MWAIT leaf. *)
CPUID.EAX = 0AH (* Returns Architectural Performance Monitoring leaf. *)
CPUID.EAX = 0BH (* Returns Extended Topology Enumeration leaf. *)2
CPUID.EAX =1FH (* Returns V2 Extended Topology Enumeration leaf. *)2
CPUID.EAX = 80000008H (* Returns linear/physical address size data. *)
CPUID.EAX = 8000000AH (* INVALID: Returns same information as CPUID.EAX = 0BH. *)
If a value entered for CPUID.EAX is less than or equal to the maximum input value and the leaf is not supported on
that processor then 0 is returned in all the registers.
When CPUID returns the highest basic leaf information as a result of an invalid input EAX value, any dependence
on input ECX value in the basic leaf is honored.
CPUID can be executed at any privilege level to serialize instruction execution. Serializing instruction execution
guarantees that any modifications to flags, registers, and memory for previous instructions are completed before
the next instruction is fetched and executed.
See also:
“Serializing Instructions” in Chapter 10, “Multiple-Processor Management,” in the Intel® 64 and IA-32 Architec-
tures Software Developer’s Manual, Volume 3A.
“Caching Translation Information” in Chapter 4, “Linear-Address Pre-Processing,” in the Intel® 64 and IA-32 Archi-
tectures Software Developer’s Manual, Volume 3A.
1. On Intel 64 processors, CPUID clears the high 32 bits of the RAX/RBX/RCX/RDX registers in all modes.
2. CPUID leaf 1FH is a preferred superset to leaf 0BH. Intel recommends first checking for the existence of CPUID leaf 1FH before
using leaf 0BH.
EBX[19:00]: Bits 51:32 of the physical address of the base of the EPC section.
EBX[31:20]: Reserved.
EDX[19:00]: Bits 51:32 of the size of the corresponding EPC section within the Processor Reserved
Memory.
EDX[31:20]: Reserved.
Intel® Processor Trace Enumeration Main Leaf (Initial EAX Value = 14H, ECX = 0)
14H NOTES:
Leaf 14H main leaf (ECX = 0).
EAX Bits 31-00: Reports the maximum sub-leaf supported in leaf 14H.
EBX Bit 00: If 1, indicates that IA32_RTIT_CTL.CR3Filter can be set to 1, and that IA32_RTIT_CR3_MATCH MSR
can be accessed.
Bit 01: If 1, indicates support of Configurable PSB and Cycle-Accurate Mode.
Bit 02: If 1, indicates support of IP Filtering, TraceStop filtering, and preservation of Intel PT MSRs across
warm reset.
Bit 03: If 1, indicates support of MTC timing packet and suppression of COFI-based packets.
Bit 04: If 1, indicates support of PTWRITE. Writes can set IA32_RTIT_CTL[12] (PTWEn) and
IA32_RTIT_CTL[5] (FUPonPTW), and PTWRITE can generate packets.
Bit 05: If 1, indicates support of Power Event Trace. Writes can set IA32_RTIT_CTL[4] (PwrEvtEn),
enabling Power Event Trace packet generation.
Bit 06: If 1, indicates support for PSB and PMI preservation. Writes can set IA32_RTIT_CTL[56] (InjectPsb-
PmiOnEnable), enabling the processor to set IA32_RTIT_STATUS[7] (PendTopaPMI) and/or IA32_R-
TIT_STATUS[6] (PendPSB) in order to preserve ToPA PMIs and/or PSBs otherwise lost due to Intel PT
disable. Writes can also set PendToPAPMI and PendPSB.
Bit 07: If 1, writes can set IA32_RTIT_CTL[31] (EventEn), enabling Event Trace packet generation.
Bit 08: If 1, writes can set IA32_RTIT_CTL[55] (DisTNT), disabling TNT packet generation.
Bit 31-09: Reserved.
While a processor may support the Processor Frequency Information leaf, fields that return a value of zero
are not supported.
System-On-Chip Vendor Attribute Enumeration Main Leaf (Initial EAX Value = 17H, ECX = 0)
17H NOTES:
Leaf 17H main leaf (ECX = 0).
Leaf 17H output depends on the initial value in ECX.
Leaf 17H sub-leaves 1 through 3 reports SOC Vendor Brand String.
Leaf 17H is valid if MaxSOCID_Index >= 3.
Leaf 17H sub-leaves 4 and above are reserved.
EAX Bits 31-00: MaxSOCID_Index. Reports the maximum input value of supported sub-leaf in leaf 17H.
EBX Bits 15-00: SOC Vendor ID.
Bit 16: IsVendorScheme. If 1, the SOC Vendor ID field is assigned via an industry standard enumeration
scheme. Otherwise, the SOC Vendor ID field is assigned by Intel.
Bits 31-17: Reserved = 0.
ECX Bits 31-00: Project ID. A unique number an SOC vendor assigns to its SOC projects.
EDX Bits 31-00: Stepping ID. A unique number within an SOC project that an SOC vendor assigns.
System-On-Chip Vendor Attribute Enumeration Sub-leaf (Initial EAX Value = 17H, ECX = 1..3)
17H EAX Bit 31-00: SOC Vendor Brand String. UTF-8 encoded string.
EBX Bit 31-00: SOC Vendor Brand String. UTF-8 encoded string.
ECX Bit 31-00: SOC Vendor Brand String. UTF-8 encoded string.
EDX Bit 31-00: SOC Vendor Brand String. UTF-8 encoded string.
NOTES:
Leaf 17H output depends on the initial value in ECX.
SOC Vendor Brand String is a UTF-8 encoded string padded with trailing bytes of 00H.
The complete SOC Vendor Brand String is constructed by concatenating in ascending order of
EAX:EBX:ECX:EDX and from the sub-leaf 1 fragment towards sub-leaf 3.
System-On-Chip Vendor Attribute Enumeration Sub-leaves (Initial EAX Value = 17H, ECX > MaxSOCID_Index)
17H NOTES:
Leaf 17H output depends on the initial value in ECX.
EAX Bits 31-00: Reserved = 0.
EBX Bits 31-00: Reserved = 0.
ECX Bits 31-00: Reserved = 0.
EDX Bits 31-00: Reserved = 0.
EAX Bits 31-00: Reports the maximum input value of supported sub-leaf in leaf 18H.
EBX Bit 00: 4K page size entries supported by this structure.
Bit 01: 2MB page size entries supported by this structure.
Bit 02: 4MB page size entries supported by this structure.
Bit 03: 1 GB page size entries supported by this structure.
Bits 07-04: Reserved.
Bits 10-08: Partitioning (0: Soft partitioning between the logical processors sharing this structure).
Bits 15-11: Reserved.
Bits 31-16: W = Ways of associativity.
ECX Bits 31-00: S = Number of Sets.
EDX Bits 04-00: Translation cache type field.
00000b: Null (indicates this sub-leaf is not valid).
00001b: Data TLB.
00010b: Instruction TLB.
00011b: Unified TLB*.
00100b: Load Only TLB. Hit on loads; fills on both loads and stores.
00101b: Store Only TLB. Hit on stores; fill on stores.
All other encodings are reserved.
Bits 07-05: Translation cache level (starts at 1).
Bit 08: Fully associative structure.
Bits 13-09: Reserved.
Bits 25-14: Maximum number of addressable IDs for logical processors sharing this translation cache.**
Bits 31-26: Reserved.
Deterministic Address Translation Parameters Sub-leaf (Initial EAX Value = 18H, ECX ≥ 1)
18H NOTES:
Each sub-leaf enumerates a different address translation structure.
If ECX contains an invalid sub-leaf index, EAX/EBX/ECX/EDX return 0. Sub-leaf index n is invalid if n
exceeds the value that sub-leaf 0 returns in EAX. A sub-leaf index is also invalid if EDX[4:0] returns 0.
Valid sub-leaves do not need to be contiguous or in any particular order. A valid sub-leaf may be in a
higher input ECX value than an invalid sub-leaf or than a valid sub-leaf of a higher or lower-level struc-
ture.
* Some unified TLBs will allow a single TLB entry to satisfy data read/write and instruction fetches.
Others will require separate entries (e.g., one loaded on data read/write and another loaded on an
instruction fetch. See the Intel® 64 and IA-32 Architectures Optimization Reference Manual for details
of a particular product.
** Add one to the return value to get the result.
NOTES:
* LAHF and SAHF are always available in other modes, regardless of the enumeration of this feature flag.
** Intel processors support SYSCALL and SYSRET only in 64-bit mode. This feature flag is always enumer-
ated as 0 outside 64-bit mode.
80000002H EAX Processor Brand String.
EBX Processor Brand String Continued.
ECX Processor Brand String Continued.
EDX Processor Brand String Continued.
80000003H EAX Processor Brand String Continued.
EBX Processor Brand String Continued.
ECX Processor Brand String Continued.
EDX Processor Brand String Continued.
80000004H EAX Processor Brand String Continued.
EBX Processor Brand String Continued.
ECX Processor Brand String Continued.
EDX Processor Brand String Continued.
80000005H EAX Reserved = 0.
EBX Reserved = 0.
ECX Reserved = 0.
EDX Reserved = 0.
80000006H EAX Reserved = 0.
EBX Reserved = 0.
ECX Bits 07-00: Cache Line size in bytes.
Bits 11-08: Reserved.
Bits 15-12: L2 Associativity field *.
Bits 31-16: Cache size in 1K units.
EDX Reserved = 0.
NOTES:
* L2 associativity field encodings:
00H - Disabled 08H - 16 ways
01H - 1 way (direct mapped) 09H - Reserved
02H - 2 ways 0AH - 32 ways
03H - Reserved 0BH - 48 ways
04H - 4 ways 0CH - 64 ways
05H - Reserved 0DH - 96 ways
06H - 8 ways 0EH - 128 ways
07H - See CPUID leaf 04H, sub-leaf 2** 0FH - Fully associative
** CPUID leaf 04H provides details of deterministic cache parameters, including the L2 cache in sub-leaf 2
INPUT EAX = 0: Returns CPUID’s Highest Value for Basic Processor Information and the Vendor Identification String
When CPUID executes with EAX set to 0, the processor returns the highest value the CPUID recognizes for
returning basic processor information. The value is returned in the EAX register and is processor specific.
A vendor identification string is also returned in EBX, EDX, and ECX. For Intel processors, the string is “Genu-
ineIntel” and is expressed:
EBX := 756e6547h (* “Genu”, with G in the low eight bits of BL *)
EDX := 49656e69h (* “ineI”, with i in the low eight bits of DL *)
ECX := 6c65746eh (* “ntel”, with n in the low eight bits of CL *)
INPUT EAX = 80000000H: Returns CPUID’s Highest Value for Extended Processor Information
When CPUID executes with EAX set to 80000000H, the processor returns the highest value the processor recog-
nizes for returning extended processor information. The value is returned in the EAX register and is processor
specific.
Reserved
OM16525
NOTE
See Chapter 20 in the Intel®
64 and IA-32 Architectures Software Developer’s Manual, Volume 1,
for information on identifying earlier IA-32 processors.
The Extended Family ID needs to be examined only when the Family ID is 0FH. Integrate the fields into a display
using the following rule:
IF Family_ID ≠ 0FH
THEN DisplayFamily = Family_ID;
ELSE DisplayFamily = Extended_Family_ID + Family_ID;
FI;
(* Show DisplayFamily as HEX field. *)
The Extended Model ID needs to be examined only when the Family ID is 06H or 0FH. Integrate the field into a
display using the following rule:
IF (Family_ID = 06H or Family_ID = 0FH)
THEN DisplayModel = (Extended_Model_ID « 4) + Model_ID;
(* Right justify and zero-extend 4-bit field; display Model_ID as HEX field.*)
ELSE DisplayModel = Model_ID;
FI;
(* Show DisplayModel as HEX field. *)
NOTE
Software must confirm that a processor feature is present using feature flags returned by CPUID
prior to using the feature. Software should not depend on future offerings retaining all features.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
ECX
0
RDRAND
F16C
AVX
OSXSAVE
XSAVE
AES
TSC-Deadline
POPCNT
MOVBE
x2APIC
SSE4_2 — SSE4.2
SSE4_1 — SSE4.1
DCA — Direct Cache Access
PCID — Process-context Identifiers
PDCM — Perf/Debug Capability MSR
xTPR Update Control
CMPXCHG16B
FMA — Fused Multiply Add
SDBG
CNXT-ID — L1 Context ID
SSSE3 — SSSE3 Extensions
TM2 — Thermal Monitor 2
EIST — Enhanced Intel SpeedStep® Technology
SMX — Safer Mode Extensions
VMX — Virtual Machine Extensions
DS-CPL — CPL Qualified Debug Store
MONITOR — MONITOR/MWAIT
DTES64 — 64-bit DS Area
PCLMULQDQ — Carryless Multiplication
SSE3 — SSE3 Extensions
OM16524b
Reserved
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
EDX
Reserved
OM16523
INPUT EAX = 02H: TLB/Cache/Prefetch Information Returned in EAX, EBX, ECX, EDX
When CPUID executes with EAX set to 02H, the processor returns information about the processor’s internal TLBs,
cache, and prefetch hardware in the EAX, EBX, ECX, and EDX registers. The information is reported in encoded
form and fall into the following categories:
• The least-significant byte in register EAX (register AL) will always return 01H. Software should ignore this value
and not interpret it as an informational descriptor.
• The most significant bit (bit 31) of each register indicates whether the register contains valid information (set
to 0) or is reserved (set to 1).
• If a register contains valid information, the information is contained in 1 byte descriptors. There are four types
of encoding values for the byte descriptor, the encoding type is noted in the second column of Table 3-21. Table
3-21 lists the encoding of these descriptors. Note that the order of descriptors in the EAX, EBX, ECX, and EDX
registers is not defined; that is, specific bytes are not designated to contain descriptors for specific cache,
prefetch, or TLB types. The descriptors may appear in any order. Note also a processor may report a general
descriptor type (FFH) and not report any byte descriptor of “cache type” via CPUID leaf 2.
INPUT EAX = 04H: Returns Deterministic Cache Parameters for Each Level
When CPUID executes with EAX set to 04H and ECX contains an index value, the processor returns encoded data
that describe a set of deterministic cache parameters (for the cache level associated with the input in ECX). Valid
index values start from 0.
Software can enumerate the deterministic cache parameters for each level of the cache hierarchy starting with an
index value of 0, until the parameters report the value associated with the cache type field is 0. The architecturally
defined fields reported by deterministic cache parameters are documented in Table 3-17.
This Cache Size in Bytes
= (Ways + 1) * (Partitions + 1) * (Line_Size + 1) * (Sets + 1)
= (EBX[31:22] + 1) * (EBX[21:12] + 1) * (EBX[11:0] + 1) * (ECX + 1)
The CPUID leaf 04H also reports data that can be used to derive the topology of processor cores in a physical
package. This information is constant for all valid index values. Software can query the raw data reported by
executing CPUID with EAX=04H and ECX=0 and use it as part of the topology enumeration algorithm described in
Chapter 10, “Multiple-Processor Management,” in the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3A.
INPUT EAX = 0FH: Returns Intel Resource Director Technology (Intel RDT) Monitoring Enumeration Information
When CPUID executes with EAX set to 0FH and ECX = 0, the processor returns information about the bit-vector
representation of QoS monitoring resource types that are supported in the processor and maximum range of RMID
values the processor can use to monitor of any supported resource types. Each bit, starting from bit 1, corresponds
to a specific resource type if the bit is set. The bit position corresponds to the sub-leaf index (or ResID) that soft-
ware must use to query QoS monitoring capability available for that type. See Table 3-17.
When CPUID executes with EAX set to 0FH and ECX = n (n >= 1, and is a valid ResID), the processor returns infor-
mation software can use to program IA32_PQR_ASSOC, IA32_QM_EVTSEL MSRs before reading QoS data from the
IA32_QM_CTR MSR.
INPUT EAX = 15H: Returns Time Stamp Counter and Nominal Core Crystal Clock Information
When CPUID executes with EAX set to 15H and ECX = 0H, the processor returns information about Time Stamp
Counter and Core Crystal Clock. See Table 3-17.
INPUT EAX = 24H: Returns Intel AVX10 Converged Vector ISA Information
When CPUID executes with EAX set to 24H, the processor returns Intel AVX10 converged vector ISA information.
See Table 3-17.
Input: EAX=
0x80000000
CPUID
CPUID
True ≥
Function
Extended
Supported
OM15194
"zHM", or
Match
"zHG", or
Substring
"zHT"
False
IF Substring Matched Report Error
If "zHG"
Multiplier = 1 x 109
Determine "Multiplier" If "zHT"
Multiplier = 1 x 1012
Scan Digits
Until Blank Reverse Digits
Determine "Freq"
In Reverse Order To Decimal Value
Processor Base
Frequency =
"Freq" = X.YZ if
"Freq" x "Multiplier"
Digits = "ZY.X"
Operation
IA32_BIOS_SIGN_ID MSR := Update with installed microcode revision number;
CASE (EAX) OF
EAX = 0:
EAX := Highest basic function input value understood by CPUID;
EBX := Vendor identification string;
EDX := Vendor identification string;
ECX := Vendor identification string;
BREAK;
EAX = 1H:
EAX[3:0] := Stepping ID;
EAX[7:4] := Model;
EAX[11:8] := Family;
Flags Affected
None.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts two, four or eight packed signed doubleword integers in the source operand (the second operand) to two,
four or eight packed double precision floating-point values in the destination operand (the first operand).
EVEX encoded versions: The source operand can be a YMM/XMM/XMM (low 64 bits) register, a 256/128/64-bit
memory location or a 256/128/64-bit vector broadcasted from a 32-bit memory location. The destination operand
is a ZMM/YMM/XMM register conditionally updated with writemask k1. Attempt to encode this instruction with EVEX
embedded rounding is ignored.
VEX.256 encoded version: The source operand is an XMM register or 128- bit memory location. The destination
operand is a YMM register.
VEX.128 encoded version: The source operand is an XMM register or 64- bit memory location. The destination
operand is a XMM register. The upper Bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.
128-bit Legacy SSE version: The source operand is an XMM register or 64- bit memory location. The destination
operand is an XMM register. The upper Bits (MAXVL-1:128) of the corresponding ZMM register destination are
unmodified.
CVTDQ2PD—Convert Packed Doubleword Integers to Packed Double Precision Floating-Point Values Vol. 2A 3-272
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
SRC X3 X2 X1 X0
DEST X3 X2 X1 X0
Operation
VCVTDQ2PD (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Integer_To_Double_Precision_Floating_Point(SRC[k+31:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
Convert_Integer_To_Double_Precision_Floating_Point(SRC[31:0])
ELSE
DEST[i+63:i] :=
Convert_Integer_To_Double_Precision_Floating_Point(SRC[k+31:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
CVTDQ2PD—Convert Packed Doubleword Integers to Packed Double Precision Floating-Point Values Vol. 2A 3-273
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-22, “Type 5 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-53, “Type E5 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
CVTDQ2PD—Convert Packed Doubleword Integers to Packed Double Precision Floating-Point Values Vol. 2A 3-274
CVTDQ2PS—Convert Packed Doubleword Integers to Packed Single Precision Floating-Point
Values
Opcode Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 5B /r A V/V SSE2 Convert four packed signed doubleword integers
CVTDQ2PS xmm1, xmm2/m128 from xmm2/mem to four packed single precision
floating-point values in xmm1.
VEX.128.0F.WIG 5B /r A V/V AVX Convert four packed signed doubleword integers
VCVTDQ2PS xmm1, xmm2/m128 from xmm2/mem to four packed single precision
floating-point values in xmm1.
VEX.256.0F.WIG 5B /r A V/V AVX Convert eight packed signed doubleword integers
VCVTDQ2PS ymm1, ymm2/m256 from ymm2/mem to eight packed single precision
floating-point values in ymm1.
EVEX.128.0F.W0 5B /r B V/V (AVX512VL AND Convert four packed signed doubleword integers
VCVTDQ2PS xmm1 {k1}{z}, AVX512F) OR from xmm2/m128/m32bcst to four packed single
xmm2/m128/m32bcst AVX10.11 precision floating-point values in xmm1with
writemask k1.
EVEX.256.0F.W0 5B /r B V/V (AVX512VL AND Convert eight packed signed doubleword integers
VCVTDQ2PS ymm1 {k1}{z}, AVX512F) OR from ymm2/m256/m32bcst to eight packed single
ymm2/m256/m32bcst AVX10.11 precision floating-point values in ymm1with
writemask k1.
EVEX.512.0F.W0 5B /r B V/V AVX512F Convert sixteen packed signed doubleword integers
VCVTDQ2PS zmm1 {k1}{z}, OR AVX10.11 from zmm2/m512/m32bcst to sixteen packed single
zmm2/m512/m32bcst {er} precision floating-point values in zmm1with
writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts four, eight or sixteen packed signed doubleword integers in the source operand to four, eight or sixteen
packed single precision floating-point values in the destination operand.
EVEX encoded versions: The source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a
ZMM/YMM/XMM register conditionally updated with writemask k1.
VEX.256 encoded version: The source operand is a YMM register or 256- bit memory location. The destination
operand is a YMM register. Bits (MAXVL-1:256) of the corresponding register destination are zeroed.
VEX.128 encoded version: The source operand is an XMM register or 128- bit memory location. The destination
operand is a XMM register. The upper bits (MAXVL-1:128) of the corresponding register destination are zeroed.
128-bit Legacy SSE version: The source operand is an XMM register or 128- bit memory location. The destination
operand is an XMM register. The upper Bits (MAXVL-1:128) of the corresponding register destination are unmodi-
fied.
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
CVTDQ2PS—Convert Packed Doubleword Integers to Packed Single Precision Floating-Point Values Vol. 2A 3-276
Operation
VCVTDQ2PS (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC); ; refer to Table 15-4 in the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 1
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC); ; refer to Table 15-4 in the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 1
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_Integer_To_Single_Precision_Floating_Point(SRC[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Integer_To_Single_Precision_Floating_Point(SRC[31:0])
ELSE
DEST[i+31:i] :=
Convert_Integer_To_Single_Precision_Floating_Point(SRC[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
CVTDQ2PS—Convert Packed Doubleword Integers to Packed Single Precision Floating-Point Values Vol. 2A 3-277
VCVTDQ2PS (VEX.256 Encoded Version)
DEST[31:0] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[31:0])
DEST[63:32] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[63:32])
DEST[95:64] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[95:64])
DEST[127:96] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[127:96)
DEST[159:128] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[159:128])
DEST[191:160] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[191:160])
DEST[223:192] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[223:192])
DEST[255:224] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[255:224)
DEST[MAXVL-1:256] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
CVTDQ2PS—Convert Packed Doubleword Integers to Packed Single Precision Floating-Point Values Vol. 2A 3-278
CVTPD2DQ—Convert Packed Double Precision Floating-Point Values to Packed Doubleword
Integers
Opcode Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
F2 0F E6 /r A V/V SSE2 Convert two packed double precision floating-point
CVTPD2DQ xmm1, xmm2/m128 values in xmm2/mem to two signed doubleword
integers in xmm1.
VEX.128.F2.0F.WIG E6 /r A V/V AVX Convert two packed double precision floating-point
VCVTPD2DQ xmm1, xmm2/m128 values in xmm2/mem to two signed doubleword
integers in xmm1.
VEX.256.F2.0F.WIG E6 /r A V/V AVX Convert four packed double precision floating-point
VCVTPD2DQ xmm1, ymm2/m256 values in ymm2/mem to four signed doubleword
integers in xmm1.
EVEX.128.F2.0F.W1 E6 /r B V/V (AVX512VL AND Convert two packed double precision floating-point
VCVTPD2DQ xmm1 {k1}{z}, AVX512F) OR values in xmm2/m128/m64bcst to two signed
xmm2/m128/m64bcst AVX10.11 doubleword integers in xmm1 subject to writemask
k1.
EVEX.256.F2.0F.W1 E6 /r B V/V (AVX512VL AND Convert four packed double precision floating-point
VCVTPD2DQ xmm1 {k1}{z}, AVX512F) OR values in ymm2/m256/m64bcst to four signed
ymm2/m256/m64bcst AVX10.11 doubleword integers in xmm1 subject to writemask
k1.
EVEX.512.F2.0F.W1 E6 /r B V/V AVX512F Convert eight packed double precision floating-
VCVTPD2DQ ymm1 {k1}{z}, OR AVX10.11 point values in zmm2/m512/m64bcst to eight
zmm2/m512/m64bcst {er} signed doubleword integers in ymm1 subject to
writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts packed double precision floating-point values in the source operand (second operand) to packed signed
doubleword integers in the destination operand (first operand).
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value
(2w-1, where w represents the number of bits in the destination format) is returned.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512-bit memory location, or a 512-bit
vector broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register condi-
tionally updated with writemask k1. The upper bits (MAXVL-1:256/128/64) of the corresponding destination are
zeroed.
VEX.256 encoded version: The source operand is a YMM register or 256- bit memory location. The destination
operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.
CVTPD2DQ—Convert Packed Double Precision Floating-Point Values to Packed Doubleword Integers Vol. 2A 3-279
VEX.128 encoded version: The source operand is an XMM register or 128- bit memory location. The destination
operand is a XMM register. The upper bits (MAXVL-1:64) of the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The source operand is an XMM register or 128- bit memory location. The destination
operand is an XMM register. Bits[127:64] of the destination XMM register are zeroed. However, the upper bits
(MAXVL-1:128) of the corresponding ZMM register destination are unmodified.
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
SRC X3 X2 X1 X0
DEST 0 X3 X2 X1 X0
Operation
VCVTPD2DQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_Integer(SRC[k+63:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0
CVTPD2DQ—Convert Packed Double Precision Floating-Point Values to Packed Doubleword Integers Vol. 2A 3-280
VCVTPD2DQ (EVEX Encoded Versions) When SRC Operand is a Memory Source
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0])
ELSE
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_Integer(SRC[k+63:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0
CVTPD2DQ—Convert Packed Double Precision Floating-Point Values to Packed Doubleword Integers Vol. 2A 3-281
Intel C/C++ Compiler Intrinsic Equivalent
VCVTPD2DQ __m256i _mm512_cvtpd_epi32( __m512d a);
VCVTPD2DQ __m256i _mm512_mask_cvtpd_epi32( __m256i s, __mmask8 k, __m512d a);
VCVTPD2DQ __m256i _mm512_maskz_cvtpd_epi32( __mmask8 k, __m512d a);
VCVTPD2DQ __m256i _mm512_cvt_roundpd_epi32( __m512d a, int r);
VCVTPD2DQ __m256i _mm512_mask_cvt_roundpd_epi32( __m256i s, __mmask8 k, __m512d a, int r);
VCVTPD2DQ __m256i _mm512_maskz_cvt_roundpd_epi32( __mmask8 k, __m512d a, int r);
VCVTPD2DQ __m128i _mm256_mask_cvtpd_epi32( __m128i s, __mmask8 k, __m256d a);
VCVTPD2DQ __m128i _mm256_maskz_cvtpd_epi32( __mmask8 k, __m256d a);
VCVTPD2DQ __m128i _mm_mask_cvtpd_epi32( __m128i s, __mmask8 k, __m128d a);
VCVTPD2DQ __m128i _mm_maskz_cvtpd_epi32( __mmask8 k, __m128d a);
VCVTPD2DQ __m128i _mm256_cvtpd_epi32 (__m256d src)
CVTPD2DQ __m128i _mm_cvtpd_epi32 (__m128d src)
Other Exceptions
See Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
CVTPD2DQ—Convert Packed Double Precision Floating-Point Values to Packed Doubleword Integers Vol. 2A 3-282
CVTPD2PS—Convert Packed Double Precision Floating-Point Values to Packed Single Precision
Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 5A /r A V/V SSE2 Convert two packed double precision floating-point
CVTPD2PS xmm1, xmm2/m128 values in xmm2/mem to two single precision
floating-point values in xmm1.
VEX.128.66.0F.WIG 5A /r A V/V AVX Convert two packed double precision floating-point
VCVTPD2PS xmm1, xmm2/m128 values in xmm2/mem to two single precision
floating-point values in xmm1.
VEX.256.66.0F.WIG 5A /r A V/V AVX Convert four packed double precision floating-
VCVTPD2PS xmm1, ymm2/m256 point values in ymm2/mem to four single precision
floating-point values in xmm1.
EVEX.128.66.0F.W1 5A /r B V/V (AVX512VL AND Convert two packed double precision floating-point
VCVTPD2PS xmm1 {k1}{z}, AVX512F) OR values in xmm2/m128/m64bcst to two single
xmm2/m128/m64bcst AVX10.11 precision floating-point values in xmm1with
writemask k1.
EVEX.256.66.0F.W1 5A /r B V/V (AVX512VL AND Convert four packed double precision floating-
VCVTPD2PS xmm1 {k1}{z}, AVX512F) OR point values in ymm2/m256/m64bcst to four
ymm2/m256/m64bcst AVX10.11 single precision floating-point values in xmm1with
writemask k1.
EVEX.512.66.0F.W1 5A /r B V/V AVX512F Convert eight packed double precision floating-
VCVTPD2PS ymm1 {k1}{z}, OR AVX10.11 point values in zmm2/m512/m64bcst to eight
zmm2/m512/m64bcst {er} single precision floating-point values in ymm1with
writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts two, four or eight packed double precision floating-point values in the source operand (second operand)
to two, four or eight packed single precision floating-point values in the destination operand (first operand).
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or
a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a
YMM/XMM/XMM (low 64-bits) register conditionally updated with writemask k1. The upper bits (MAXVL-
1:256/128/64) of the corresponding destination are zeroed.
VEX.256 encoded version: The source operand is a YMM register or 256- bit memory location. The destination
operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.
VEX.128 encoded version: The source operand is an XMM register or 128- bit memory location. The destination
operand is a XMM register. The upper bits (MAXVL-1:64) of the corresponding ZMM register destination are zeroed.
CVTPD2PS—Convert Packed Double Precision Floating-Point Values to Packed Single Precision Floating-Point Values Vol. 2A 3-284
128-bit Legacy SSE version: The source operand is an XMM register or 128- bit memory location. The destination
operand is an XMM register. Bits[127:64] of the destination XMM register are zeroed. However, the upper Bits
(MAXVL-1:128) of the corresponding ZMM register destination are unmodified.
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
SRC X3 X2 X1 X0
DEST 0 X3 X2 X1 X0
Operation
VCVTPD2PS (EVEX Encoded Version) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN
DEST[i+31:i] := Convert_Double_Precision_Floating_Point_To_Single_Precision_Floating_Point(SRC[k+63:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0
CVTPD2PS—Convert Packed Double Precision Floating-Point Values to Packed Single Precision Floating-Point Values Vol. 2A 3-285
VCVTPD2PS (EVEX Encoded Version) When SRC Operand is a Memory Source
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=Convert_Double_Precision_Floating_Point_To_Single_Precision_Floating_Point(SRC[63:0])
ELSE
DEST[i+31:i] := Convert_Double_Precision_Floating_Point_To_Single_Precision_Floating_Point(SRC[k+63:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0
CVTPD2PS—Convert Packed Double Precision Floating-Point Values to Packed Single Precision Floating-Point Values Vol. 2A 3-286
Intel C/C++ Compiler Intrinsic Equivalent
VCVTPD2PS __m256 _mm512_cvtpd_ps( __m512d a);
VCVTPD2PS __m256 _mm512_mask_cvtpd_ps( __m256 s, __mmask8 k, __m512d a);
VCVTPD2PS __m256 _mm512_maskz_cvtpd_ps( __mmask8 k, __m512d a);
VCVTPD2PS __m256 _mm512_cvt_roundpd_ps( __m512d a, int r);
VCVTPD2PS __m256 _mm512_mask_cvt_roundpd_ps( __m256 s, __mmask8 k, __m512d a, int r);
VCVTPD2PS __m256 _mm512_maskz_cvt_roundpd_ps( __mmask8 k, __m512d a, int r);
VCVTPD2PS __m128 _mm256_mask_cvtpd_ps( __m128 s, __mmask8 k, __m256d a);
VCVTPD2PS __m128 _mm256_maskz_cvtpd_ps( __mmask8 k, __m256d a);
VCVTPD2PS __m128 _mm_mask_cvtpd_ps( __m128 s, __mmask8 k, __m128d a);
VCVTPD2PS __m128 _mm_maskz_cvtpd_ps( __mmask8 k, __m128d a);
VCVTPD2PS __m128 _mm256_cvtpd_ps (__m256d a)
CVTPD2PS __m128 _mm_cvtpd_ps (__m128d a)
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
CVTPD2PS—Convert Packed Double Precision Floating-Point Values to Packed Single Precision Floating-Point Values Vol. 2A 3-287
CVTPS2DQ—Convert Packed Single Precision Floating-Point Values to Packed Signed
Doubleword Integer Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 5B /r A V/V SSE2 Convert four packed single precision floating-point
CVTPS2DQ xmm1, xmm2/m128 values from xmm2/mem to four packed signed
doubleword values in xmm1.
VEX.128.66.0F.WIG 5B /r A V/V AVX Convert four packed single precision floating-point
VCVTPS2DQ xmm1, xmm2/m128 values from xmm2/mem to four packed signed
doubleword values in xmm1.
VEX.256.66.0F.WIG 5B /r A V/V AVX Convert eight packed single precision floating-point
VCVTPS2DQ ymm1, ymm2/m256 values from ymm2/mem to eight packed signed
doubleword values in ymm1.
EVEX.128.66.0F.W0 5B /r B V/V (AVX512VL AND Convert four packed single precision floating-point
VCVTPS2DQ xmm1 {k1}{z}, AVX512F) OR values from xmm2/m128/m32bcst to four packed
xmm2/m128/m32bcst AVX10.11 signed doubleword values in xmm1 subject to
writemask k1.
EVEX.256.66.0F.W0 5B /r B V/V (AVX512VL AND Convert eight packed single precision floating-point
VCVTPS2DQ ymm1 {k1}{z}, AVX512F) OR values from ymm2/m256/m32bcst to eight packed
ymm2/m256/m32bcst AVX10.11 signed doubleword values in ymm1 subject to
writemask k1.
EVEX.512.66.0F.W0 5B /r B V/V AVX512F Convert sixteen packed single precision floating-point
VCVTPS2DQ zmm1 {k1}{z}, OR AVX10.11 values from zmm2/m512/m32bcst to sixteen packed
zmm2/m512/m32bcst {er} signed doubleword values in zmm1 subject to
writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts four, eight or sixteen packed single precision floating-point values in the source operand to four, eight or
sixteen signed doubleword integers in the destination operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value
(2w-1, where w represents the number of bits in the destination format) is returned.
EVEX encoded versions: The source operand is a ZMM register, a 512-bit memory location or a 512-bit vector
broadcasted from a 32-bit memory location. The destination operand is a ZMM register conditionally updated with
writemask k1.
VEX.256 encoded version: The source operand is a YMM register or 256- bit memory location. The destination
operand is a YMM register. The upper bits (MAXVL-1:256) of the corresponding ZMM register destination are
zeroed.
CVTPS2DQ—Convert Packed Single Precision Floating-Point Values to Packed Signed Doubleword Integer Values Vol. 2A 3-290
VEX.128 encoded version: The source operand is an XMM register or 128- bit memory location. The destination
operand is a XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.
128-bit Legacy SSE version: The source operand is an XMM register or 128- bit memory location. The destination
operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
unmodified.
VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
Operation
VCVTPS2DQ (Encoded Versions) When SRC Operand is a Register
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_Integer(SRC[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO 15
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_Integer(SRC[31:0])
ELSE
DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_Integer(SRC[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
CVTPS2DQ—Convert Packed Single Precision Floating-Point Values to Packed Signed Doubleword Integer Values Vol. 2A 3-291
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
CVTPS2DQ—Convert Packed Single Precision Floating-Point Values to Packed Signed Doubleword Integer Values Vol. 2A 3-292
CVTPS2PD—Convert Packed Single Precision Floating-Point Values to Packed Double Precision
Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 5A /r A V/V SSE2 Convert two packed single precision floating-point
CVTPS2PD xmm1, xmm2/m64 values in xmm2/m64 to two packed double precision
floating-point values in xmm1.
VEX.128.0F.WIG 5A /r A V/V AVX Convert two packed single precision floating-point
VCVTPS2PD xmm1, xmm2/m64 values in xmm2/m64 to two packed double precision
floating-point values in xmm1.
VEX.256.0F.WIG 5A /r A V/V AVX Convert four packed single precision floating-point
VCVTPS2PD ymm1, xmm2/m128 values in xmm2/m128 to four packed double precision
floating-point values in ymm1.
EVEX.128.0F.W0 5A /r B V/V (AVX512VL AND Convert two packed single precision floating-point
VCVTPS2PD xmm1 {k1}{z}, AVX512F) OR values in xmm2/m64/m32bcst to packed double
xmm2/m64/m32bcst AVX10.11 precision floating-point values in xmm1 with writemask
k1.
EVEX.256.0F.W0 5A /r B V/V (AVX512VL AND Convert four packed single precision floating-point
VCVTPS2PD ymm1 {k1}{z}, AVX512F) OR values in xmm2/m128/m32bcst to packed double
xmm2/m128/m32bcst AVX10.11 precision floating-point values in ymm1 with writemask
k1.
EVEX.512.0F.W0 5A /r B V/V AVX512F Convert eight packed single precision floating-point
VCVTPS2PD zmm1 {k1}{z}, OR AVX10.11 values in ymm2/m256/b32bcst to eight packed double
ymm2/m256/m32bcst {sae} precision floating-point values in zmm1 with writemask
k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts two, four or eight packed single precision floating-point values in the source operand (second operand) to
two, four or eight packed double precision floating-point values in the destination operand (first operand).
EVEX encoded versions: The source operand is a YMM/XMM/XMM (low 64-bits) register, a 256/128/64-bit memory
location or a 256/128/64-bit vector broadcasted from a 32-bit memory location. The destination operand is a
ZMM/YMM/XMM register conditionally updated with writemask k1.
VEX.256 encoded version: The source operand is an XMM register or 128- bit memory location. The destination
operand is a YMM register. Bits (MAXVL-1:256) of the corresponding destination ZMM register are zeroed.
VEX.128 encoded version: The source operand is an XMM register or 64- bit memory location. The destination
operand is a XMM register. The upper Bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.
128-bit Legacy SSE version: The source operand is an XMM register or 64- bit memory location. The destination
operand is an XMM register. The upper Bits (MAXVL-1:128) of the corresponding ZMM register destination are
unmodified.
CVTPS2PD—Convert Packed Single Precision Floating-Point Values to Packed Double Precision Floating-Point Values Vol. 2A 3-293
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
SRC X3 X2 X1 X0
DEST X3 X2 X1 X0
Operation
VCVTPS2PD (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC[k+31:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC[31:0])
ELSE
DEST[i+63:i] :=
Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC[k+31:k])
FI;
CVTPS2PD—Convert Packed Single Precision Floating-Point Values to Packed Double Precision Floating-Point Values Vol. 2A 3-294
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
CVTPS2PD—Convert Packed Single Precision Floating-Point Values to Packed Double Precision Floating-Point Values Vol. 2A 3-295
CVTSD2SI—Convert Scalar Double Precision Floating-Point Value to Doubleword Integer
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F2 0F 2D /r A V/V SSE2 Convert one double precision floating-point value from
CVTSD2SI r32, xmm1/m64 xmm1/m64 to one signed doubleword integer r32.
F2 REX.W 0F 2D /r A V/N.E. SSE2 Convert one double precision floating-point value from
CVTSD2SI r64, xmm1/m64 xmm1/m64 to one signed quadword integer sign-
extended into r64.
VEX.LIG.F2.0F.W0 2D /r 1 A V/V AVX Convert one double precision floating-point value from
VCVTSD2SI r32, xmm1/m64 xmm1/m64 to one signed doubleword integer r32.
VEX.LIG.F2.0F.W1 2D /r 1 A V/N.E.2 AVX Convert one double precision floating-point value from
VCVTSD2SI r64, xmm1/m64 xmm1/m64 to one signed quadword integer sign-
extended into r64.
EVEX.LLIG.F2.0F.W0 2D /r B V/V AVX512F Convert one double precision floating-point value from
VCVTSD2SI r32, xmm1/m64{er} OR AVX10.13 xmm1/m64 to one signed doubleword integer r32.
EVEX.LLIG.F2.0F.W1 2D /r B V/N.E.2 AVX512F Convert one double precision floating-point value from
VCVTSD2SI r64, xmm1/m64{er} OR AVX10.13 xmm1/m64 to one signed quadword integer sign-
extended into r64.
NOTES:
1. Software should ensure VCVTSD2SI is encoded with VEX.L=0. Encoding VCVTSD2SI with VEX.L=1 may encounter unpredictable
behavior across different processor generations.
2. VEX.W1/EVEX.W1 in non-64 bit is ignored; the instructions behaves as if the W0 version is used.
3. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts a double precision floating-point value in the source operand (the second operand) to a signed double-
word integer in the destination operand (first operand). The source operand can be an XMM register or a 64-bit
memory location. The destination operand is a general-purpose register. When the source operand is an XMM
register, the double precision floating-point value is contained in the low quadword of the register.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register.
If a converted result exceeds the range limits of signed doubleword integer (in non-64-bit modes or 64-bit mode
with REX.W/VEX.W/EVEX.W=0), the floating-point invalid exception is raised, and if this exception is masked, the
indefinite integer value (80000000H) is returned.
If a converted result exceeds the range limits of signed quadword integer (in 64-bit mode and
REX.W/VEX.W/EVEX.W = 1), the floating-point invalid exception is raised, and if this exception is masked, the
indefinite integer value (80000000_00000000H) is returned.
Legacy SSE instruction: Use of the REX.W prefix promotes the instruction to produce 64-bit data in 64-bit mode.
See the summary chart at the beginning of this section for encoding data and limits.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
CVTSD2SI—Convert Scalar Double Precision Floating-Point Value to Doubleword Integer Vol. 2A 3-297
Software should ensure VCVTSD2SI is encoded with VEX.L=0. Encoding VCVTSD2SI with VEX.L=1 may encounter
unpredictable behavior across different processor generations.
Operation
VCVTSD2SI (EVEX Encoded Version)
IF SRC *is register* AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF 64-Bit Mode and OperandSize = 64
THEN DEST[63:0] := Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0]);
ELSE DEST[31:0] := Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0]);
FI
(V)CVTSD2SI
IF 64-Bit Mode and OperandSize = 64
THEN
DEST[63:0] := Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0]);
ELSE
DEST[31:0] := Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0]);
FI;
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
CVTSD2SI—Convert Scalar Double Precision Floating-Point Value to Doubleword Integer Vol. 2A 3-298
CVTSD2SS—Convert Scalar Double Precision Floating-Point Value to Scalar Single Precision
Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F2 0F 5A /r A V/V SSE2 Convert one double precision floating-point value in
CVTSD2SS xmm1, xmm2/m64 xmm2/m64 to one single precision floating-point value
in xmm1.
VEX.LIG.F2.0F.WIG 5A /r B V/V AVX Convert one double precision floating-point value in
VCVTSD2SS xmm1,xmm2, xmm3/m64 xmm3/m64 to one single precision floating-point value
and merge with high bits in xmm2.
EVEX.LLIG.F2.0F.W1 5A /r C V/V AVX512F Convert one double precision floating-point value in
VCVTSD2SS xmm1 {k1}{z}, xmm2, OR AVX10.11 xmm3/m64 to one single precision floating-point value
xmm3/m64{er} and merge with high bits in xmm2 under writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts a double precision floating-point value in the “convert-from” source operand (the second operand in SSE2
version, otherwise the third operand) to a single precision floating-point value in the destination operand.
When the “convert-from” operand is an XMM register, the double precision floating-point value is contained in the
low quadword of the register. The result is stored in the low doubleword of the destination operand. When the
conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register.
128-bit Legacy SSE version: The “convert-from” source operand (the second operand) is an XMM register or
memory location. Bits (MAXVL-1:32) of the corresponding destination register remain unchanged. The destination
operand is an XMM register.
VEX.128 and EVEX encoded versions: The “convert-from” source operand (the third operand) can be an XMM
register or a 64-bit memory location. The first source and destination operands are XMM registers. Bits (127:32) of
the XMM register destination are copied from the corresponding bits in the first source operand. Bits (MAXVL-
1:128) of the destination register are zeroed.
EVEX encoded version: the converted result in written to the low doubleword element of the destination under the
writemask.
Software should ensure VCVTSD2SS is encoded with VEX.L=0. Encoding VCVTSD2SS with VEX.L=1 may encounter
unpredictable behavior across different processor generations.
CVTSD2SS—Convert Scalar Double Precision Floating-Point Value to Scalar Single Precision Floating-Point Value Vol. 2A 3-299
Operation
VCVTSD2SS (EVEX Encoded Version)
IF (SRC2 *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC2[63:0]);
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
CVTSD2SS—Convert Scalar Double Precision Floating-Point Value to Scalar Single Precision Floating-Point Value Vol. 2A 3-300
CVTSI2SD—Convert Doubleword Integer to Scalar Double Precision Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature
Support Flag
F2 0F 2A /r A V/V SSE2 Convert one signed doubleword integer from
CVTSI2SD xmm1, r32/m32 r32/m32 to one double precision floating-point
value in xmm1.
F2 REX.W 0F 2A /r A V/N.E. SSE2 Convert one signed quadword integer from r/m64
CVTSI2SD xmm1, r/m64 to one double precision floating-point value in
xmm1.
VEX.LIG.F2.0F.W0 2A /r B V/V AVX Convert one signed doubleword integer from
VCVTSI2SD xmm1, xmm2, r/m32 r/m32 to one double precision floating-point value
in xmm1.
VEX.LIG.F2.0F.W1 2A /r B V/N.E.1 AVX Convert one signed quadword integer from r/m64
VCVTSI2SD xmm1, xmm2, r/m64 to one double precision floating-point value in
xmm1.
EVEX.LLIG.F2.0F.W0 2A /r C V/V AVX512F Convert one signed doubleword integer from
VCVTSI2SD xmm1, xmm2, r/m32 OR r/m32 to one double precision floating-point value
AVX10.12 in xmm1.
EVEX.LLIG.F2.0F.W1 2A /r C V/N.E.1 AVX512F Convert one signed quadword integer from r/m64
VCVTSI2SD xmm1, xmm2, r/m64{er} OR to one double precision floating-point value in
AVX10.12 xmm1.
NOTES:
1. VEX.W1/EVEX.W1 in non-64 bit is ignored; the instructions behaves as if the W0 version is used.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts a signed doubleword integer (or signed quadword integer if operand size is 64 bits) in the “convert-from”
source operand to a double precision floating-point value in the destination operand. The result is stored in the low
quadword of the destination operand, and the high quadword left unchanged. When conversion is inexact, the
value returned is rounded according to the rounding control bits in the MXCSR register.
The second source operand can be a general-purpose register or a 32/64-bit memory location. The first source and
destination operands are XMM registers.
128-bit Legacy SSE version: Use of the REX.W prefix promotes the instruction to 64-bit operands. The “convert-
from” source operand (the second operand) is a general-purpose register or memory location. The destination is
an XMM register Bits (MAXVL-1:64) of the corresponding destination register remain unchanged.
VEX.128 and EVEX encoded versions: The “convert-from” source operand (the third operand) can be a general-
purpose register or a memory location. The first source and destination operands are XMM registers. Bits (127:64)
of the XMM register destination are copied from the corresponding bits in the first source operand. Bits (MAXVL-
1:128) of the destination register are zeroed.
EVEX.W0 version: attempt to encode this instruction with EVEX embedded rounding is ignored.
VEX.W1 and EVEX.W1 versions: promotes the instruction to use 64-bit input value in 64-bit mode.
CVTSI2SD—Convert Doubleword Integer to Scalar Double Precision Floating-Point Value Vol. 2A 3-301
Software should ensure VCVTSI2SD is encoded with VEX.L=0. Encoding VCVTSI2SD with VEX.L=1 may encounter
unpredictable behavior across different processor generations.
Operation
VCVTSI2SD (EVEX Encoded Version)
IF (SRC2 *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF 64-Bit Mode And OperandSize = 64
THEN
DEST[63:0] := Convert_Integer_To_Double_Precision_Floating_Point(SRC2[63:0]);
ELSE
DEST[63:0] := Convert_Integer_To_Double_Precision_Floating_Point(SRC2[31:0]);
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0
CVTSI2SD
IF 64-Bit Mode And OperandSize = 64
THEN
DEST[63:0] := Convert_Integer_To_Double_Precision_Floating_Point(SRC[63:0]);
ELSE
DEST[63:0] := Convert_Integer_To_Double_Precision_Floating_Point(SRC[31:0]);
FI;
DEST[MAXVL-1:64] (Unmodified)
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions,” if W1; else see Table 2-22, “Type
5 Class Exception Conditions.”
CVTSI2SD—Convert Doubleword Integer to Scalar Double Precision Floating-Point Value Vol. 2A 3-302
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions,” if W1; else see Table 2-61,
“Type E10NF Class Exception Conditions.”
CVTSI2SD—Convert Doubleword Integer to Scalar Double Precision Floating-Point Value Vol. 2A 3-303
CVTSI2SS—Convert Doubleword Integer to Scalar Single Precision Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature
Support Flag
F3 0F 2A /r A V/V SSE Convert one signed doubleword integer from r/m32
CVTSI2SS xmm1, r/m32 to one single precision floating-point value in xmm1.
F3 REX.W 0F 2A /r A V/N.E. SSE Convert one signed quadword integer from r/m64 to
CVTSI2SS xmm1, r/m64 one single precision floating-point value in xmm1.
VEX.LIG.F3.0F.W0 2A /r B V/V AVX Convert one signed doubleword integer from r/m32
VCVTSI2SS xmm1, xmm2, r/m32 to one single precision floating-point value in xmm1.
VEX.LIG.F3.0F.W1 2A /r B V/N.E.1 AVX Convert one signed quadword integer from r/m64 to
VCVTSI2SS xmm1, xmm2, r/m64 one single precision floating-point value in xmm1.
EVEX.LLIG.F3.0F.W0 2A /r C V/V AVX512F Convert one signed doubleword integer from r/m32
VCVTSI2SS xmm1, xmm2, r/m32{er} OR to one single precision floating-point value in xmm1.
AVX10.12
EVEX.LLIG.F3.0F.W1 2A /r C V/N.E.1 AVX512F Convert one signed quadword integer from r/m64 to
VCVTSI2SS xmm1, xmm2, r/m64{er} OR one single precision floating-point value in xmm1.
AVX10.12
NOTES:
1. VEX.W1/EVEX.W1 in non-64 bit is ignored; the instructions behaves as if the W0 version is used.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts a signed doubleword integer (or signed quadword integer if operand size is 64 bits) in the “convert-from”
source operand to a single precision floating-point value in the destination operand (first operand). The “convert-
from” source operand can be a general-purpose register or a memory location. The destination operand is an XMM
register. The result is stored in the low doubleword of the destination operand, and the upper three doublewords
are left unchanged. When a conversion is inexact, the value returned is rounded according to the rounding control
bits in the MXCSR register or the embedded rounding control bits.
128-bit Legacy SSE version: In 64-bit mode, Use of the REX.W prefix promotes the instruction to use 64-bit input
value. The “convert-from” source operand (the second operand) is a general-purpose register or memory location.
Bits (MAXVL-1:32) of the corresponding destination register remain unchanged.
VEX.128 and EVEX encoded versions: The “convert-from” source operand (the third operand) can be a general-
purpose register or a memory location. The first source and destination operands are XMM registers. Bits (127:32)
of the XMM register destination are copied from corresponding bits in the first source operand. Bits (MAXVL-1:128)
of the destination register are zeroed.
EVEX encoded version: the converted result in written to the low doubleword element of the destination under the
writemask.
Software should ensure VCVTSI2SS is encoded with VEX.L=0. Encoding VCVTSI2SS with VEX.L=1 may encounter
unpredictable behavior across different processor generations.
CVTSI2SS—Convert Doubleword Integer to Scalar Single Precision Floating-Point Value Vol. 2A 3-303
Operation
VCVTSI2SS (EVEX Encoded Version)
IF (SRC2 *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF 64-Bit Mode And OperandSize = 64
THEN
DEST[31:0] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[63:0]);
ELSE
DEST[31:0] := Convert_Integer_To_Single_Precision_Floating_Point(SRC[31:0]);
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
CVTSI2SS—Convert Doubleword Integer to Scalar Single Precision Floating-Point Value Vol. 2A 3-304
CVTSS2SD—Convert Scalar Single Precision Floating-Point Value to Scalar Double Precision
Floating-Point Value
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F3 0F 5A /r A V/V SSE2 Convert one single precision floating-point value in
CVTSS2SD xmm1, xmm2/m32 xmm2/m32 to one double precision floating-point value
in xmm1.
VEX.LIG.F3.0F.WIG 5A /r B V/V AVX Convert one single precision floating-point value in
VCVTSS2SD xmm1, xmm2, xmm3/m32 to one double precision floating-point value
xmm3/m32 and merge with high bits of xmm2.
EVEX.LLIG.F3.0F.W0 5A /r C V/V AVX512F Convert one single precision floating-point value in
VCVTSS2SD xmm1 {k1}{z}, xmm2, OR AVX10.11 xmm3/m32 to one double precision floating-point value
xmm3/m32{sae} and merge with high bits of xmm2 under writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts a single precision floating-point value in the “convert-from” source operand to a double precision floating-
point value in the destination operand. When the “convert-from” source operand is an XMM register, the single
precision floating-point value is contained in the low doubleword of the register. The result is stored in the low
quadword of the destination operand.
128-bit Legacy SSE version: The “convert-from” source operand (the second operand) is an XMM register or
memory location. Bits (MAXVL-1:64) of the corresponding destination register remain unchanged. The destination
operand is an XMM register.
VEX.128 and EVEX encoded versions: The “convert-from” source operand (the third operand) can be an XMM
register or a 32-bit memory location. The first source and destination operands are XMM registers. Bits (127:64) of
the XMM register destination are copied from the corresponding bits in the first source operand. Bits (MAXVL-
1:128) of the destination register are zeroed.
Software should ensure VCVTSS2SD is encoded with VEX.L=0. Encoding VCVTSS2SD with VEX.L=1 may encounter
unpredictable behavior across different processor generations.
CVTSS2SD—Convert Scalar Single Precision Floating-Point Value to Scalar Double Precision Floating-Point Value Vol. 2A 3-305
Operation
VCVTSS2SD (EVEX Encoded Version)
IF k1[0] or *no writemask*
THEN DEST[63:0] := Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC2[31:0]);
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] = 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
CVTSS2SD—Convert Scalar Single Precision Floating-Point Value to Scalar Double Precision Floating-Point Value Vol. 2A 3-306
CVTSS2SI—Convert Scalar Single Precision Floating-Point Value to Doubleword Integer
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F3 0F 2D /r A V/V SSE Convert one single precision floating-point value from
CVTSS2SI r32, xmm1/m32 xmm1/m32 to one signed doubleword integer in r32.
F3 REX.W 0F 2D /r A V/N.E. SSE Convert one single precision floating-point value from
CVTSS2SI r64, xmm1/m32 xmm1/m32 to one signed quadword integer in r64.
VEX.LIG.F3.0F.W0 2D /r 1 A V/V AVX Convert one single precision floating-point value from
VCVTSS2SI r32, xmm1/m32 xmm1/m32 to one signed doubleword integer in r32.
VEX.LIG.F3.0F.W1 2D /r 1 A V/N.E.2 AVX Convert one single precision floating-point value from
VCVTSS2SI r64, xmm1/m32 xmm1/m32 to one signed quadword integer in r64.
EVEX.LLIG.F3.0F.W0 2D /r B V/V AVX512F Convert one single precision floating-point value from
VCVTSS2SI r32, xmm1/m32{er} OR AVX10.13 xmm1/m32 to one signed doubleword integer in r32.
EVEX.LLIG.F3.0F.W1 2D /r B V/N.E.2 AVX512F Convert one single precision floating-point value from
VCVTSS2SI r64, xmm1/m32{er} OR AVX10.13 xmm1/m32 to one signed quadword integer in r64.
NOTES:
1. Software should ensure VCVTSS2SI is encoded with VEX.L=0. Encoding VCVTSS2SI with VEX.L=1 may
encounter unpredictable behavior across different processor generations.
2. VEX.W1/EVEX.W1 in non-64 bit is ignored; the instructions behaves as if the W0 version is used.
3. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts a single precision floating-point value in the source operand (the second operand) to a signed doubleword
integer (or signed quadword integer if operand size is 64 bits) in the destination operand (the first operand). The
source operand can be an XMM register or a memory location. The destination operand is a general-purpose
register. When the source operand is an XMM register, the single precision floating-point value is contained in the
low doubleword of the register.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value
(2w-1, where w represents the number of bits in the destination format) is returned.
Legacy SSE instructions: In 64-bit mode, Use of the REX.W prefix promotes the instruction to produce 64-bit data.
See the summary chart at the beginning of this section for encoding data and limits.
VEX.W1 and EVEX.W1 versions: promotes the instruction to produce 64-bit data in 64-bit mode.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
Software should ensure VCVTSS2SI is encoded with VEX.L=0. Encoding VCVTSS2SI with VEX.L=1 may encounter
unpredictable behavior across different processor generations.
CVTSS2SI—Convert Scalar Single Precision Floating-Point Value to Doubleword Integer Vol. 2A 3-307
Operation
VCVTSS2SI (EVEX Encoded Version)
IF (SRC *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF 64-bit Mode and OperandSize = 64
THEN
DEST[63:0] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[31:0]);
ELSE
DEST[31:0] := Convert_Single_Precision_Floating_Point_To_Integer(SRC[31:0]);
FI;
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
CVTSS2SI—Convert Scalar Single Precision Floating-Point Value to Doubleword Integer Vol. 2A 3-308
CVTTPD2DQ—Convert with Truncation Packed Double Precision Floating-Point Values to
Packed Doubleword Integers
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F E6 /r A V/V SSE2 Convert two packed double precision floating-point
CVTTPD2DQ xmm1, xmm2/m128 values in xmm2/mem to two signed doubleword
integers in xmm1 using truncation.
VEX.128.66.0F.WIG E6 /r A V/V AVX Convert two packed double precision floating-point
VCVTTPD2DQ xmm1, xmm2/m128 values in xmm2/mem to two signed doubleword
integers in xmm1 using truncation.
VEX.256.66.0F.WIG E6 /r A V/V AVX Convert four packed double precision floating-point
VCVTTPD2DQ xmm1, ymm2/m256 values in ymm2/mem to four signed doubleword
integers in xmm1 using truncation.
EVEX.128.66.0F.W1 E6 /r B V/V (AVX512VL AND Convert two packed double precision floating-point
VCVTTPD2DQ xmm1 {k1}{z}, AVX512F) OR values in xmm2/m128/m64bcst to two signed
xmm2/m128/m64bcst AVX10.11 doubleword integers in xmm1 using truncation
subject to writemask k1.
EVEX.256.66.0F.W1 E6 /r B V/V (AVX512VL AND Convert four packed double precision floating-point
VCVTTPD2DQ xmm1 {k1}{z}, AVX512F) OR values in ymm2/m256/m64bcst to four signed
ymm2/m256/m64bcst AVX10.11 doubleword integers in xmm1 using truncation
subject to writemask k1.
EVEX.512.66.0F.W1 E6 /r B V/V AVX512F Convert eight packed double precision floating-point
VCVTTPD2DQ ymm1 {k1}{z}, OR AVX10.11 values in zmm2/m512/m64bcst to eight signed
zmm2/m512/m64bcst {sae} doubleword integers in ymm1 using truncation
subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts two, four or eight packed double precision floating-point values in the source operand (second operand)
to two, four or eight packed signed doubleword integers in the destination operand (first operand).
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result is larger than
the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is
masked, the indefinite integer value (80000000H) is returned.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or
a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a
YMM/XMM/XMM (low 64 bits) register conditionally updated with writemask k1. The upper bits (MAXVL-1:256) of
the corresponding destination are zeroed.
VEX.256 encoded version: The source operand is a YMM register or 256- bit memory location. The destination
operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.
CVTTPD2DQ—Convert with Truncation Packed Double Precision Floating-Point Values to Packed Doubleword Integers Vol. 2A 3-309
VEX.128 encoded version: The source operand is an XMM register or 128- bit memory location. The destination
operand is a XMM register. The upper bits (MAXVL-1:64) of the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The source operand is an XMM register or 128- bit memory location. The destination
operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
unmodified.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
SRC X3 X2 X1 X0
DEST 0 X3 X2 X1 X0
Operation
VCVTTPD2DQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[k+63:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0
CVTTPD2DQ—Convert with Truncation Packed Double Precision Floating-Point Values to Packed Doubleword Integers Vol. 2A 3-310
VCVTTPD2DQ (EVEX Encoded Versions) When SRC Operand is a Memory Source
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[63:0])
ELSE
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[k+63:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0
CVTTPD2DQ—Convert with Truncation Packed Double Precision Floating-Point Values to Packed Doubleword Integers Vol. 2A 3-311
Intel C/C++ Compiler Intrinsic Equivalent
VCVTTPD2DQ __m256i _mm512_cvttpd_epi32( __m512d a);
VCVTTPD2DQ __m256i _mm512_mask_cvttpd_epi32( __m256i s, __mmask8 k, __m512d a);
VCVTTPD2DQ __m256i _mm512_maskz_cvttpd_epi32( __mmask8 k, __m512d a);
VCVTTPD2DQ __m256i _mm512_cvtt_roundpd_epi32( __m512d a, int sae);
VCVTTPD2DQ __m256i _mm512_mask_cvtt_roundpd_epi32( __m256i s, __mmask8 k, __m512d a, int sae);
VCVTTPD2DQ __m256i _mm512_maskz_cvtt_roundpd_epi32( __mmask8 k, __m512d a, int sae);
VCVTTPD2DQ __m128i _mm256_mask_cvttpd_epi32( __m128i s, __mmask8 k, __m256d a);
VCVTTPD2DQ __m128i _mm256_maskz_cvttpd_epi32( __mmask8 k, __m256d a);
VCVTTPD2DQ __m128i _mm_mask_cvttpd_epi32( __m128i s, __mmask8 k, __m128d a);
VCVTTPD2DQ __m128i _mm_maskz_cvttpd_epi32( __mmask8 k, __m128d a);
VCVTTPD2DQ __m128i _mm256_cvttpd_epi32 (__m256d src);
CVTTPD2DQ __m128i _mm_cvttpd_epi32 (__m128d src);
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
CVTTPD2DQ—Convert with Truncation Packed Double Precision Floating-Point Values to Packed Doubleword Integers Vol. 2A 3-312
CVTTPS2DQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed
Signed Doubleword Integer Values
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F3 0F 5B /r A V/V SSE2 Convert four packed single precision floating-point
CVTTPS2DQ xmm1, xmm2/m128 values from xmm2/mem to four packed signed
doubleword values in xmm1 using truncation.
VEX.128.F3.0F.WIG 5B /r A V/V AVX Convert four packed single precision floating-point
VCVTTPS2DQ xmm1, xmm2/m128 values from xmm2/mem to four packed signed
doubleword values in xmm1 using truncation.
VEX.256.F3.0F.WIG 5B /r A V/V AVX Convert eight packed single precision floating-point
VCVTTPS2DQ ymm1, ymm2/m256 values from ymm2/mem to eight packed signed
doubleword values in ymm1 using truncation.
EVEX.128.F3.0F.W0 5B /r B V/V AVX512VL Convert four packed single precision floating-point
VCVTTPS2DQ xmm1 {k1}{z}, AVX512F values from xmm2/m128/m32bcst to four packed
xmm2/m128/m32bcst signed doubleword values in xmm1 using truncation
subject to writemask k1.
EVEX.256.F3.0F.W0 5B /r B V/V AVX512VL Convert eight packed single precision floating-point
VCVTTPS2DQ ymm1 {k1}{z}, AVX512F values from ymm2/m256/m32bcst to eight packed
ymm2/m256/m32bcst signed doubleword values in ymm1 using truncation
subject to writemask k1.
EVEX.512.F3.0F.W0 5B /r B V/V AVX512F Convert sixteen packed single precision floating-point
VCVTTPS2DQ zmm1 {k1}{z}, values from zmm2/m512/m32bcst to sixteen packed
zmm2/m512/m32bcst {sae} signed doubleword values in zmm1 using truncation
subject to writemask k1.
Description
Converts four, eight or sixteen packed single precision floating-point values in the source operand to four, eight or
sixteen signed doubleword integers in the destination operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result is larger than
the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is
masked, the indefinite integer value (80000000H) is returned.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or
a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a
ZMM/YMM/XMM register conditionally updated with writemask k1.
VEX.256 encoded version: The source operand is a YMM register or 256- bit memory location. The destination
operand is a YMM register. The upper bits (MAXVL-1:256) of the corresponding ZMM register destination are
zeroed.
VEX.128 encoded version: The source operand is an XMM register or 128- bit memory location. The destination
operand is a XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.
128-bit Legacy SSE version: The source operand is an XMM register or 128- bit memory location. The destination
operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
unmodified.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
CVTTPS2DQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Signed Doubleword Integer Values Vol. 2A 3-314
Operation
VCVTTPS2DQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO 15
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[31:0])
ELSE
DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
CVTTPS2DQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Signed Doubleword Integer Values Vol. 2A 3-315
VCVTTPS2DQ (VEX.128 Encoded Version)
DEST[31:0] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[31:0])
DEST[63:32] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[63:32])
DEST[95:64] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[95:64])
DEST[127:96] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[127:96])
DEST[MAXVL-1:128] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
CVTTPS2DQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Signed Doubleword Integer Values Vol. 2A 3-316
CVTTSD2SI—Convert With Truncation Scalar Double Precision Floating-Point Value to Signed
Integer
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F2 0F 2C /r A V/V SSE2 Convert one double precision floating-point value
CVTTSD2SI r32, xmm1/m64 from xmm1/m64 to one signed doubleword integer in
r32 using truncation.
F2 REX.W 0F 2C /r A V/N.E. SSE2 Convert one double precision floating-point value
CVTTSD2SI r64, xmm1/m64 from xmm1/m64 to one signed quadword integer in
r64 using truncation.
VEX.LIG.F2.0F.W0 2C /r 1 A V/V AVX Convert one double precision floating-point value
VCVTTSD2SI r32, xmm1/m64 from xmm1/m64 to one signed doubleword integer in
r32 using truncation.
VEX.LIG.F2.0F.W1 2C /r 1 B V/N.E.2 AVX Convert one double precision floating-point value
VCVTTSD2SI r64, xmm1/m64 from xmm1/m64 to one signed quadword integer in
r64 using truncation.
EVEX.LLIG.F2.0F.W0 2C /r B V/V AVX512F Convert one double precision floating-point value
VCVTTSD2SI r32, xmm1/m64{sae} OR AVX10.13 from xmm1/m64 to one signed doubleword integer in
r32 using truncation.
EVEX.LLIG.F2.0F.W1 2C /r B V/N.E.2 AVX512F Convert one double precision floating-point value
VCVTTSD2SI r64, xmm1/m64{sae} OR AVX10.13 from xmm1/m64 to one signed quadword integer in
r64 using truncation.
NOTES:
1. Software should ensure VCVTTSD2SI is encoded with VEX.L=0. Encoding VCVTTSD2SI with VEX.L=1 may encounter unpredictable
behavior across different processor generations.
2. For this specific instruction, VEX.W/EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used.
3. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts a double precision floating-point value in the source operand (the second operand) to a signed double-
word integer (or signed quadword integer if operand size is 64 bits) in the destination operand (the first operand).
The source operand can be an XMM register or a 64-bit memory location. The destination operand is a general
purpose register. When the source operand is an XMM register, the double precision floating-point value is
contained in the low quadword of the register.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register.
If a converted result exceeds the range limits of signed doubleword integer (in non-64-bit modes or 64-bit mode
with REX.W/VEX.W/EVEX.W=0), the floating-point invalid exception is raised, and if this exception is masked, the
indefinite integer value (80000000H) is returned.
If a converted result exceeds the range limits of signed quadword integer (in 64-bit mode and
REX.W/VEX.W/EVEX.W = 1), the floating-point invalid exception is raised, and if this exception is masked, the
indefinite integer value (80000000_00000000H) is returned.
CVTTSD2SI—Convert With Truncation Scalar Double Precision Floating-Point Value to Signed Integer Vol. 2A 3-318
Legacy SSE instructions: In 64-bit mode, Use of the REX.W prefix promotes the instruction to 64-bit operation. See
the summary chart at the beginning of this section for encoding data and limits.
VEX.W1 and EVEX.W1 versions: promotes the instruction to produce 64-bit data in 64-bit mode.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
Software should ensure VCVTTSD2SI is encoded with VEX.L=0. Encoding VCVTTSD2SI with VEX.L=1 may
encounter unpredictable behavior across different processor generations.
Operation
(V)CVTTSD2SI (All Versions)
IF 64-Bit Mode and OperandSize = 64
THEN
DEST[63:0] := Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[63:0]);
ELSE
DEST[31:0] := Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[63:0]);
FI;
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
CVTTSD2SI—Convert With Truncation Scalar Double Precision Floating-Point Value to Signed Integer Vol. 2A 3-319
CVTTSS2SI—Convert With Truncation Scalar Single Precision Floating-Point Value to Signed
Integer
Opcode/ Op / 64/32 bit CPUID Description
Instruction En Mode Feature Flag
Support
F3 0F 2C /r A V/V SSE Convert one single precision floating-point value from
CVTTSS2SI r32, xmm1/m32 xmm1/m32 to one signed doubleword integer in r32
using truncation.
F3 REX.W 0F 2C /r A V/N.E. SSE Convert one single precision floating-point value from
CVTTSS2SI r64, xmm1/m32 xmm1/m32 to one signed quadword integer in r64
using truncation.
VEX.LIG.F3.0F.W0 2C /r 1 A V/V AVX Convert one single precision floating-point value from
VCVTTSS2SI r32, xmm1/m32 xmm1/m32 to one signed doubleword integer in r32
using truncation.
VEX.LIG.F3.0F.W1 2C /r 1 A V/N.E.2 AVX Convert one single precision floating-point value from
VCVTTSS2SI r64, xmm1/m32 xmm1/m32 to one signed quadword integer in r64
using truncation.
EVEX.LLIG.F3.0F.W0 2C /r B V/V AVX512F Convert one single precision floating-point value from
VCVTTSS2SI r32, xmm1/m32{sae} OR AVX10.13 xmm1/m32 to one signed doubleword integer in r32
using truncation.
EVEX.LLIG.F3.0F.W1 2C /r B V/N.E.2 AVX512F Convert one single precision floating-point value from
VCVTTSS2SI r64, xmm1/m32{sae} OR AVX10.13 xmm1/m32 to one signed quadword integer in r64
using truncation.
NOTES:
1. Software should ensure VCVTTSS2SI is encoded with VEX.L=0. Encoding VCVTTSS2SI with VEX.L=1 may encounter unpredictable
behavior across different processor generations.
2. For this specific instruction, VEX.W/EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used.
3. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts a single precision floating-point value in the source operand (the second operand) to a signed doubleword
integer (or signed quadword integer if operand size is 64 bits) in the destination operand (the first operand). The
source operand can be an XMM register or a 32-bit memory location. The destination operand is a general purpose
register. When the source operand is an XMM register, the single precision floating-point value is contained in the
low doubleword of the register.
When a conversion is inexact, a truncated (round toward zero) result is returned. If a converted result is larger
than the maximum signed doubleword integer, the floating-point invalid exception is raised. If this exception is
masked, the indefinite integer value (80000000H or 80000000_00000000H if operand size is 64 bits) is returned.
Legacy SSE instructions: In 64-bit mode, Use of the REX.W prefix promotes the instruction to 64-bit operation. See
the summary chart at the beginning of this section for encoding data and limits.
VEX.W1 and EVEX.W1 versions: promotes the instruction to produce 64-bit data in 64-bit mode.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
CVTTSS2SI—Convert With Truncation Scalar Single Precision Floating-Point Value to Signed Integer Vol. 2A 3-320
Software should ensure VCVTTSS2SI is encoded with VEX.L=0. Encoding VCVTTSS2SI with VEX.L=1 may
encounter unpredictable behavior across different processor generations.
Operation
(V)CVTTSS2SI (All Versions)
IF 64-Bit Mode and OperandSize = 64
THEN
DEST[63:0] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[31:0]);
ELSE
DEST[31:0] := Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[31:0]);
FI;
Other Exceptions
See Table 2-20, “Type 3 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
CVTTSS2SI—Convert With Truncation Scalar Single Precision Floating-Point Value to Signed Integer Vol. 2A 3-321
DIVPD—Divide Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 5E /r A V/V SSE2 Divide packed double precision floating-point
DIVPD xmm1, xmm2/m128 values in xmm1 by packed double precision
floating-point values in xmm2/mem.
VEX.128.66.0F.WIG 5E /r B V/V AVX Divide packed double precision floating-point
VDIVPD xmm1, xmm2, xmm3/m128 values in xmm2 by packed double precision
floating-point values in xmm3/mem.
VEX.256.66.0F.WIG 5E /r B V/V AVX Divide packed double precision floating-point
VDIVPD ymm1, ymm2, ymm3/m256 values in ymm2 by packed double precision
floating-point values in ymm3/mem.
EVEX.128.66.0F.W1 5E /r C V/V (AVX512VL AND Divide packed double precision floating-point
VDIVPD xmm1 {k1}{z}, xmm2, AVX512F) OR values in xmm2 by packed double precision
xmm3/m128/m64bcst AVX10.11 floating-point values in xmm3/m128/m64bcst and
write results to xmm1 subject to writemask k1.
EVEX.256.66.0F.W1 5E /r C V/V (AVX512VL AND Divide packed double precision floating-point
VDIVPD ymm1 {k1}{z}, ymm2, AVX512F) OR values in ymm2 by packed double precision
ymm3/m256/m64bcst AVX10.11 floating-point values in ymm3/m256/m64bcst and
write results to ymm1 subject to writemask k1.
EVEX.512.66.0F.W1 5E /r C V/V AVX512F Divide packed double precision floating-point
VDIVPD zmm1 {k1}{z}, zmm2, OR AVX10.11 values in zmm2 by packed double precision
zmm3/m512/m64bcst{er} floating-point values in zmm3/m512/m64bcst and
write results to zmm1 subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD divide of the double precision floating-point values in the first source operand by the floating-
point values in the second source operand (the third operand). Results are written to the destination operand (the
first operand).
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
VEX.256 encoded version: The first source operand (the second operand) is a YMM register. The second source
operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register. The upper
bits (MAXVL-1:256) of the corresponding destination are zeroed.
VEX.128 encoded version: The first source operand (the second operand) is a XMM register. The second source
operand can be a XMM register or a 128-bit memory location. The destination operand is a XMM register. The upper
bits (MAXVL-1:128) of the corresponding destination are zeroed.
Operation
VDIVPD (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1) AND SRC2 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC); ; refer to Table 15-4 in the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 1
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := SRC1[i+63:i] / SRC2[63:0]
ELSE
DEST[i+63:i] := SRC1[i+63:i] / SRC2[i+63:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD divide of the four, eight or sixteen packed single precision floating-point values in the first source
operand (the second operand) by the four, eight or sixteen packed single precision floating-point values in the
second source operand (the third operand). Results are written to the destination operand (the first operand).
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
Operation
VDIVPS (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1) AND SRC2 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+31:i] := SRC1[i+31:i] / SRC2[31:0]
ELSE
DEST[i+31:i] := SRC1[i+31:i] / SRC2[i+31:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Divides the low double precision floating-point value in the first source operand by the low double precision
floating-point value in the second source operand, and stores the double precision floating-point result in the desti-
nation operand. The second source operand can be an XMM register or a 64-bit memory location. The first source
and destination are XMM registers.
128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAXVL-
1:64) of the corresponding ZMM destination register remain unchanged.
VEX.128 encoded version: The first source operand is an xmm register encoded by VEX.vvvv. The quadword at bits
127:64 of the destination operand is copied from the corresponding quadword of the first source operand. Bits
(MAXVL-1:128) of the destination register are zeroed.
EVEX.128 encoded version: The first source operand is an xmm register encoded by EVEX.vvvv. The quadword
element of the destination operand at bits 127:64 are copied from the first source operand. Bits (MAXVL-1:128) of
the destination register are zeroed.
EVEX version: The low quadword element of the destination is updated according to the writemask.
Software should ensure VDIVSD is encoded with VEX.L=0. Encoding VDIVSD with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Divides the low single precision floating-point value in the first source operand by the low single precision floating-
point value in the second source operand, and stores the single precision floating-point result in the destination
operand. The second source operand can be an XMM register or a 32-bit memory location.
128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAXVL-
1:32) of the corresponding YMM destination register remain unchanged.
VEX.128 encoded version: The first source operand is an xmm register encoded by VEX.vvvv. The three high-order
doublewords of the destination operand are copied from the first source operand. Bits (MAXVL-1:128) of the desti-
nation register are zeroed.
EVEX.128 encoded version: The first source operand is an xmm register encoded by EVEX.vvvv. The doubleword
elements of the destination operand at bits 127:32 are copied from the first source operand. Bits (MAXVL-1:128)
of the destination register are zeroed.
EVEX version: The low doubleword element of the destination is updated according to the writemask.
Software should ensure VDIVSS is encoded with VEX.L=0. Encoding VDIVSS with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Extracts a single precision floating-point value from the source operand (second operand) at the 32-bit offset spec-
ified from imm8. Immediate bits higher than the most significant offset for the vector length are ignored.
The extracted single precision floating-point value is stored in the low 32-bits of the destination operand
In 64-bit mode, destination register operand has default operand size of 64 bits. The upper 32-bits of the register
are filled with zero. REX.W is ignored.
VEX.128 and EVEX encoded version: When VEX.W1 or EVEX.W1 form is used in 64-bit mode with a general
purpose register (GPR) as a destination operand, the packed single quantity is zero extended to 64 bits.
VEX.vvvv/EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
128-bit Legacy SSE version: When a REX.W prefix is used in 64-bit mode with a general purpose register (GPR) as
a destination operand, the packed single quantity is zero extended to 64 bits.
The source register is an XMM register. Imm8[1:0] determine the starting DWORD offset from which to extract the
32-bit floating-point value.
If VEXTRACTPS is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause
an #UD exception.
Other Exceptions
VEX-encoded instructions, see Table 2-22, “Type 5 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-59, “Type E9NF Class Exception Conditions.”
Additionally:
#UD IF VEX.L = 0.
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
The AFFINEINVB instruction computes an affine transformation in the Galois Field 28. For this instruction, an affine
transformation is defined by A * inv(x) + b where “A” is an 8 by 8 bit matrix, and “x” and “b” are 8-bit vectors. The
inverse of the bytes in x is defined with respect to the reduction polynomial x8 + x4 + x3 + x + 1.
One SIMD register (operand 1) holds “x” as either 16, 32 or 64 8-bit vectors. A second SIMD (operand 2) register
or memory operand contains 2, 4, or 8 “A” values, which are operated upon by the correspondingly aligned 8 “x”
values in the first register. The “b” vector is constant for all calculations and contained in the immediate byte.
The EVEX encoded form of this instruction does not support memory fault suppression. The SSE encoded forms of
the instruction require 16B alignment on their memory operations.
The inverse of each byte is given by the following table. The upper nibble is on the vertical axis and the lower nibble
is on the horizontal axis. For example, the inverse of 0x95 is 0x8A.
Operation
define affine_inverse_byte(tsrc2qw, src1byte, imm):
FOR i := 0 to 7:
* parity(x) = 1 if x has an odd number of 1s in it, and 0 otherwise.*
* inverse(x) is defined in the table above *
retbyte.bit[i] := parity(tsrc2qw.byte[7-i] AND inverse(src1byte)) XOR imm8.bit[i]
return retbyte
FOR b := 0 to 7:
IF k1[j*8+b] OR *no writemask*:
FOR i := 0 to 7:
DEST.qword[j].byte[b] := affine_inverse_byte(tsrc2, SRC1.qword[j].byte[b], imm8)
ELSE IF *zeroing*:
DEST.qword[j].byte[b] := 0
*ELSE DEST.qword[j].byte[b] remains unchanged*
DEST[MAX_VL-1:VL] := 0
Other Exceptions
Legacy-encoded and VEX-encoded: See Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded: See Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
The AFFINEB instruction computes an affine transformation in the Galois Field 28. For this instruction, an affine
transformation is defined by A * x + b where “A” is an 8 by 8 bit matrix, and “x” and “b” are 8-bit vectors. One SIMD
register (operand 1) holds “x” as either 16, 32 or 64 8-bit vectors. A second SIMD (operand 2) register or memory
operand contains 2, 4, or 8 “A” values, which are operated upon by the correspondingly aligned 8 “x” values in the
first register. The “b” vector is constant for all calculations and contained in the immediate byte.
The EVEX encoded form of this instruction does not support memory fault suppression. The SSE encoded forms of
the instruction require16B alignment on their memory operations.
Operation
define parity(x):
t := 0 // single bit
FOR i := 0 to 7:
t = t xor x.bit[i]
return t
FOR b := 0 to 7:
IF k1[j*8+b] OR *no writemask*:
DEST.qword[j].byte[b] := affine_byte(tsrc2, SRC1.qword[j].byte[b], imm8)
ELSE IF *zeroing*:
DEST.qword[j].byte[b] := 0
*ELSE DEST.qword[j].byte[b] remains unchanged*
DEST[MAX_VL-1:VL] := 0
VGF2P8AFFINEQB dest, src1, src2, imm8 (128b and 256b VEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256)
FOR j := 0 TO KL-1:
FOR b := 0 to 7:
DEST.qword[j].byte[b] := affine_byte(SRC2.qword[j], SRC1.qword[j].byte[b], imm8)
DEST[MAX_VL-1:VL] := 0
Other Exceptions
Legacy-encoded and VEX-encoded: See Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded: See Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
The instruction multiplies elements in the finite field GF(28), operating on a byte (field element) in the first source
operand and the corresponding byte in a second source operand. The field GF(28) is represented in polynomial
representation with the reduction polynomial x8 + x4 + x3 + x + 1.
This instruction does not support broadcasting.
The EVEX encoded form of this instruction supports memory fault suppression. The SSE encoded forms of the
instruction require16B alignment on their memory operations.
VGF2P8MULB dest, src1, src2 (128b and 256b VEX Encoded Versions)
(KL, VL) = (16, 128), (32, 256)
FOR j := 0 TO KL-1:
DEST.byte[j] := gf2p8mul_byte(SRC1.byte[j], SRC2.byte[j])
DEST[MAX_VL-1:VL] := 0
Other Exceptions
Legacy-encoded and VEX-encoded: See Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded: See Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
(register source form)
Copy a single precision scalar floating-point element into a 128-bit vector register. The immediate operand has
three fields, where the ZMask bits specify which elements of the destination will be set to zero, the Count_D bits
specify which element of the destination will be overwritten with the scalar value, and for vector register sources
the Count_S bits specify which element of the source will be copied. When the scalar source is a memory operand
the Count_S bits are ignored.
(memory source form)
Load a floating-point element from a 32-bit memory location and destination operand it into the first source at the
location indicated by the Count_D bits of the immediate operand. Store in the destination and zero out destination
elements based on the ZMask bits of the immediate operand.
128-bit Legacy SSE version: The first source register is an XMM register. The second source operand is either an
XMM register or a 32-bit memory location. The destination is not distinct from the first source XMM register and the
upper bits (MAXVL-1:128) of the corresponding register destination are unmodified.
VEX.128 and EVEX encoded version: The destination and first source register is an XMM register. The second
source operand is either an XMM register or a 32-bit memory location. The upper bits (MAXVL-1:128) of the corre-
sponding register destination are zeroed.
If VINSERTPS is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause
an #UD exception.
CASE (COUNT_D) OF
0: TMP2[31:0] := TMP
TMP2[127:32] := DEST[127:32]
1: TMP2[63:32] := TMP
TMP2[31:0] := DEST[31:0]
TMP2[127:64] := DEST[127:64]
2: TMP2[95:64] := TMP
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions,” additionally:
#UD If VEX.L = 0.
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”
Description
Adds the vector mask k2 and the vector mask k3, and writes the result into vector mask k1.
Operation
KADDW
DEST[15:0] := SRC1[15:0] + SRC2[15:0]
DEST[MAX_KL-1:16] := 0
KADDB
DEST[7:0] := SRC1[7:0] + SRC2[7:0]
DEST[MAX_KL-1:8] := 0
KADDQ
DEST[63:0] := SRC1[63:0] + SRC2[63:0]
DEST[MAX_KL-1:64] := 0
KADDD
DEST[31:0] := SRC1[31:0] + SRC2[31:0]
DEST[MAX_KL-1:32] := 0
Flags Affected
None.
Description
Performs a bitwise AND NOT between the vector mask k2 and the vector mask k3, and writes the result into vector
mask k1.
Operation
KANDNW
DEST[15:0] := (BITWISE NOT SRC1[15:0]) BITWISE AND SRC2[15:0]
DEST[MAX_KL-1:16] := 0
KANDNB
DEST[7:0] := (BITWISE NOT SRC1[7:0]) BITWISE AND SRC2[7:0]
DEST[MAX_KL-1:8] := 0
KANDNQ
DEST[63:0] := (BITWISE NOT SRC1[63:0]) BITWISE AND SRC2[63:0]
DEST[MAX_KL-1:64] := 0
KANDND
DEST[31:0] := (BITWISE NOT SRC1[31:0]) BITWISE AND SRC2[31:0]
DEST[MAX_KL-1:32] := 0
Flags Affected
None.
Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”
Description
Performs a bitwise AND between the vector mask k2 and the vector mask k3, and writes the result into vector mask
k1.
Operation
KANDW
DEST[15:0] := SRC1[15:0] BITWISE AND SRC2[15:0]
DEST[MAX_KL-1:16] := 0
KANDB
DEST[7:0] := SRC1[7:0] BITWISE AND SRC2[7:0]
DEST[MAX_KL-1:8] := 0
KANDQ
DEST[63:0] := SRC1[63:0] BITWISE AND SRC2[63:0]
DEST[MAX_KL-1:64] := 0
KANDD
DEST[31:0] := SRC1[31:0] BITWISE AND SRC2[31:0]
DEST[MAX_KL-1:32] := 0
Flags Affected
None.
Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”
Description
Copies values from the source operand (second operand) to the destination operand (first operand). The source
and destination operands can be mask registers, memory location or general purpose. The instruction cannot be
used to transfer data between general purpose registers and or memory locations.
Operation
KMOVW
IF *destination is a memory location*
DEST[15:0] := SRC[15:0]
IF *destination is a mask register or a GPR *
DEST := ZeroExtension(SRC[15:0])
KMOVB
IF *destination is a memory location*
DEST[7:0] := SRC[7:0]
IF *destination is a mask register or a GPR *
DEST := ZeroExtension(SRC[7:0])
KMOVQ
IF *destination is a memory location or a GPR*
DEST[63:0] := SRC[63:0]
IF *destination is a mask register*
DEST := ZeroExtension(SRC[63:0])
KMOVD
IF *destination is a memory location*
DEST[31:0] := SRC[31:0]
IF *destination is a mask register or a GPR *
DEST := ZeroExtension(SRC[31:0])
Flags Affected
None.
Other Exceptions
Instructions with RR operand encoding, see Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask
Instructions w/o Memory Arg).”
Instructions with RM or MR operand encoding, see Table 2-66, “TYPE K21 Exception Definition (VEX-Encoded
OpMask Instructions Addressing Memory).”
Description
Performs a bitwise NOT of vector mask k2 and writes the result into vector mask k1.
Operation
KNOTW
DEST[15:0] := BITWISE NOT SRC[15:0]
DEST[MAX_KL-1:16] := 0
KNOTB
DEST[7:0] := BITWISE NOT SRC[7:0]
DEST[MAX_KL-1:8] := 0
KNOTQ
DEST[63:0] := BITWISE NOT SRC[63:0]
DEST[MAX_KL-1:64] := 0
KNOTD
DEST[31:0] := BITWISE NOT SRC[31:0]
DEST[MAX_KL-1:32] := 0
Flags Affected
None.
Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”
Description
Performs a bitwise OR between the vector mask register k2, and the vector mask register k1, and sets CF and ZF
based on the operation result.
ZF flag is set if both sources are 0x0. CF is set if, after the OR operation is done, the operation result is all 1’s.
Operation
KORTESTW
TMP[15:0] := DEST[15:0] BITWISE OR SRC[15:0]
IF(TMP[15:0]=0)
THEN ZF := 1
ELSE ZF := 0
FI;
IF(TMP[15:0]=FFFFh)
THEN CF := 1
ELSE CF := 0
FI;
KORTESTB
TMP[7:0] := DEST[7:0] BITWISE OR SRC[7:0]
IF(TMP[7:0]=0)
THEN ZF := 1
ELSE ZF := 0
FI;
IF(TMP[7:0]==FFh)
THEN CF := 1
ELSE CF := 0
FI;
KORTESTD
TMP[31:0] := DEST[31:0] BITWISE OR SRC[31:0]
IF(TMP[31:0]=0)
THEN ZF := 1
ELSE ZF := 0
FI;
IF(TMP[31:0]=FFFFFFFFh)
THEN CF := 1
ELSE CF := 0
FI;
Flags Affected
The ZF flag is set if the result of OR-ing both sources is all 0s.
The CF flag is set if the result of OR-ing both sources is all 1s.
The OF, SF, AF, and PF flags are set to 0.
Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”
Description
Performs a bitwise OR between the vector mask k2 and the vector mask k3, and writes the result into vector mask
k1 (three-operand form).
Operation
KORW
DEST[15:0] := SRC1[15:0] BITWISE OR SRC2[15:0]
DEST[MAX_KL-1:16] := 0
KORB
DEST[7:0] := SRC1[7:0] BITWISE OR SRC2[7:0]
DEST[MAX_KL-1:8] := 0
KORQ
DEST[63:0] := SRC1[63:0] BITWISE OR SRC2[63:0]
DEST[MAX_KL-1:64] := 0
KORD
DEST[31:0] := SRC1[31:0] BITWISE OR SRC2[31:0]
DEST[MAX_KL-1:32] := 0
Flags Affected
None.
Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”
Description
Shifts 8/16/32/64 bits in the second operand (source operand) left by the count specified in immediate byte and
place the least significant 8/16/32/64 bits of the result in the destination operand. The higher bits of the destina-
tion are zero-extended. The destination is set to zero if the count value is greater than 7 (for byte shift), 15 (for
word shift), 31 (for doubleword shift) or 63 (for quadword shift).
Operation
KSHIFTLW
COUNT := imm8[7:0]
DEST[MAX_KL-1:0] := 0
IF COUNT <=15
THEN DEST[15:0] := SRC1[15:0] << COUNT;
FI;
KSHIFTLB
COUNT := imm8[7:0]
DEST[MAX_KL-1:0] := 0
IF COUNT <=7
THEN DEST[7:0] := SRC1[7:0] << COUNT;
FI;
KSHIFTLQ
COUNT := imm8[7:0]
DEST[MAX_KL-1:0] := 0
IF COUNT <=63
THEN DEST[63:0] := SRC1[63:0] << COUNT;
FI;
Flags Affected
None.
Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”
Description
Shifts 8/16/32/64 bits in the second operand (source operand) right by the count specified in immediate and place
the least significant 8/16/32/64 bits of the result in the destination operand. The higher bits of the destination are
zero-extended. The destination is set to zero if the count value is greater than 7 (for byte shift), 15 (for word shift),
31 (for doubleword shift) or 63 (for quadword shift).
Operation
KSHIFTRW
COUNT := imm8[7:0]
DEST[MAX_KL-1:0] := 0
IF COUNT <=15
THEN DEST[15:0] := SRC1[15:0] >> COUNT;
FI;
KSHIFTRB
COUNT := imm8[7:0]
DEST[MAX_KL-1:0] := 0
IF COUNT <=7
THEN DEST[7:0] := SRC1[7:0] >> COUNT;
FI;
KSHIFTRQ
COUNT := imm8[7:0]
DEST[MAX_KL-1:0] := 0
IF COUNT <=63
THEN DEST[63:0] := SRC1[63:0] >> COUNT;
FI;
Flags Affected
None.
Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”
Description
Performs a bitwise comparison of the bits of the first source operand and corresponding bits in the second source
operand. If the AND operation produces all zeros, the ZF is set else the ZF is clear. If the bitwise AND operation of
the inverted first source operand with the second source operand produces all zeros the CF is set else the CF is
clear. Only the EFLAGS register is updated.
Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Operation
KTESTW
TEMP[15:0] := SRC2[15:0] AND SRC1[15:0]
IF (TEMP[15:0] = = 0)
THEN ZF :=1;
ELSE ZF := 0;
FI;
TEMP[15:0] := SRC2[15:0] AND NOT SRC1[15:0]
IF (TEMP[15:0] = = 0)
THEN CF :=1;
ELSE CF := 0;
FI;
AF := OF := PF := SF := 0;
KTESTB
TEMP[7:0] := SRC2[7:0] AND SRC1[7:0]
IF (TEMP[7:0] = = 0)
THEN ZF :=1;
ELSE ZF := 0;
FI;
TEMP[7:0] := SRC2[7:0] AND NOT SRC1[7:0]
IF (TEMP[7:0] = = 0)
THEN CF :=1;
ELSE CF := 0;
FI;
AF := OF := PF := SF := 0;
KTESTD
TEMP[31:0] := SRC2[31:0] AND SRC1[31:0]
IF (TEMP[31:0] = = 0)
THEN ZF :=1;
ELSE ZF := 0;
FI;
TEMP[31:0] := SRC2[31:0] AND NOT SRC1[31:0]
IF (TEMP[31:0] = = 0)
THEN CF :=1;
ELSE CF := 0;
FI;
AF := OF := PF := SF := 0;
Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”
Description
Unpacks the lower 8/16/32 bits of the second and third operands (source operands) into the low part of the first
operand (destination operand), starting from the low bytes. The result is zero-extended in the destination.
Operation
KUNPCKBW
DEST[7:0] := SRC2[7:0]
DEST[15:8] := SRC1[7:0]
DEST[MAX_KL-1:16] := 0
KUNPCKWD
DEST[15:0] := SRC2[15:0]
DEST[31:16] := SRC1[15:0]
DEST[MAX_KL-1:32] := 0
KUNPCKDQ
DEST[31:0] := SRC2[31:0]
DEST[63:32] := SRC1[31:0]
DEST[MAX_KL-1:64] := 0
Flags Affected
None.
Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”
Description
Performs a bitwise XNOR between the vector mask k2 and the vector mask k3, and writes the result into vector
mask k1 (three-operand form).
Operation
KXNORW
DEST[15:0] := NOT (SRC1[15:0] BITWISE XOR SRC2[15:0])
DEST[MAX_KL-1:16] := 0
KXNORB
DEST[7:0] := NOT (SRC1[7:0] BITWISE XOR SRC2[7:0])
DEST[MAX_KL-1:8] := 0
KXNORQ
DEST[63:0] := NOT (SRC1[63:0] BITWISE XOR SRC2[63:0])
DEST[MAX_KL-1:64] := 0
KXNORD
DEST[31:0] := NOT (SRC1[31:0] BITWISE XOR SRC2[31:0])
DEST[MAX_KL-1:32] := 0
Flags Affected
None.
Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”
Description
Performs a bitwise XOR between the vector mask k2 and the vector mask k3, and writes the result into vector mask
k1 (three-operand form).
Operation
KXORW
DEST[15:0] := SRC1[15:0] BITWISE XOR SRC2[15:0]
DEST[MAX_KL-1:16] := 0
KXORB
DEST[7:0] := SRC1[7:0] BITWISE XOR SRC2[7:0]
DEST[MAX_KL-1:8] := 0
KXORQ
DEST[63:0] := SRC1[63:0] BITWISE XOR SRC2[63:0]
DEST[MAX_KL-1:64] := 0
KXORD
DEST[31:0] := SRC1[31:0] BITWISE XOR SRC2[31:0]
DEST[MAX_KL-1:32] := 0
Flags Affected
None.
Other Exceptions
See Table 2-65, “TYPE K20 Exception Definition (VEX-Encoded OpMask Instructions w/o Memory Arg).”
------------------------------------------------------------------------------------------
Changes to this chapter:
• Updated the PREFETCHh instruction to add details for using the temporal code hints.
• Added the RDMSRLIST instruction.
• Updated the RDPMC instruction to add details on RDPMC Metrics Clear.
• Added the TDPFP16PS instruction.
• Updated the UIRET instruction to add the pseudocode describing software control of the value of the user-
interrupt flag (UIF) established by UIRET.
• Added Intel® AVX10.1 information to the following instructions:
— MAXPD
— MAXPS
— MAXSD
— MAXSS
— MINPD
— MINPS
— MINSD
— MINSS
— MOVAPD
— MOVAPS
— MOVD/MOVQ
— MOVDDUP
— MOVDQA,VMOVDQA32/64
— MOVDQU,VMOVDQU8/16/32/64
— MOVHLPS
— MOVHPD
— MOVHPS
— MOVLHPS
— MOVLPD
— MOVLPS
— MOVNTDQ
— MOVNTDQA
— MOVNTPD
— MOVNTPS
— MOVQ
— MOVSD
— MOVSHDUP
— MOVSLDUP
— MOVSS
— MOVUPD
— MOVUPS
— MULPD
— MULPS
— MULSD
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed double precision floating-point values in the first source operand and the
second source operand and returns the maximum value for each pair of values to the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that
is, a QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source
operand (from either the first or second operand) be returned, the action of MAXPD can be emulated using a
sequence of instructions, such as a comparison followed by AND, ANDN, and OR.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
Operation
MAX(SRC1, SRC2)
{
IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST := SRC2;
ELSE IF (SRC1 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC2 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC1 > SRC2) THEN DEST := SRC1;
ELSE DEST := SRC2;
FI;
}
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed single precision floating-point values in the first source operand and the
second source operand and returns the maximum value for each pair of values to the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that
is, a QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source
operand (from either the first or second operand) be returned, the action of MAXPS can be emulated using a
sequence of instructions, such as, a comparison followed by AND, ANDN, and OR.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of
the corresponding ZMM register destination are zeroed.
Operation
MAX(SRC1, SRC2)
{
IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST := SRC2;
ELSE IF (SRC1 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC2 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC1 > SRC2) THEN DEST := SRC1;
ELSE DEST := SRC2;
FI;
}
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Compares the low double precision floating-point values in the first source operand and the second source
operand, and returns the maximum value to the low quadword of the destination operand. The second source
operand can be an XMM register or a 64-bit memory location. The first source and destination operands are XMM
registers. When the second source operand is a memory operand, only 64 bits are accessed.
If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If
a value in the second source operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a
QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid
floating-point value, is written to the result. If instead of this behavior, it is required that the NaN of either source
operand be returned, the action of MAXSD can be emulated using a sequence of instructions, such as, a compar-
ison followed by AND, ANDN, and OR.
128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL-1:64) of the
corresponding destination register remain unchanged.
VEX.128 and EVEX encoded version: Bits (127:64) of the XMM register destination are copied from corresponding
bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
EVEX encoded version: The low quadword element of the destination operand is updated according to the write-
mask.
Software should ensure VMAXSD is encoded with VEX.L=0. Encoding VMAXSD with VEX.L=1 may encounter
unpredictable behavior across different processor generations.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Compares the low single precision floating-point values in the first source operand and the second source operand,
and returns the maximum value to the low doubleword of the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If
a value in the second source operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a
QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid
floating-point value, is written to the result. If instead of this behavior, it is required that the NaN from either source
operand be returned, the action of MAXSS can be emulated using a sequence of instructions, such as, a comparison
followed by AND, ANDN, and OR.
The second source operand can be an XMM register or a 32-bit memory location. The first source and destination
operands are XMM registers.
128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL:32) of the corre-
sponding destination register remain unchanged.
VEX.128 and EVEX encoded version: The first source operand is an xmm register encoded by VEX.vvvv. Bits
(127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits
(MAXVL:128) of the destination register are zeroed.
EVEX encoded version: The low doubleword element of the destination operand is updated according to the write-
mask.
Software should ensure VMAXSS is encoded with VEX.L=0. Encoding VMAXSS with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed double precision floating-point values in the first source operand and the
second source operand and returns the minimum value for each pair of values to the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that
is, a QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source
operand (from either the first or second operand) be returned, the action of MINPD can be emulated using a
sequence of instructions, such as, a comparison followed by AND, ANDN, and OR.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of
the corresponding ZMM register destination are zeroed.
Operation
MIN(SRC1, SRC2)
{
IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST := SRC2;
ELSE IF (SRC1 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC2 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC1 < SRC2) THEN DEST := SRC1;
ELSE DEST := SRC2;
FI;
}
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed single precision floating-point values in the first source operand and the
second source operand and returns the minimum value for each pair of values to the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that
is, a QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source
operand (from either the first or second operand) be returned, the action of MINPS can be emulated using a
sequence of instructions, such as, a comparison followed by AND, ANDN, and OR.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of
the corresponding ZMM register destination are zeroed.
Operation
MIN(SRC1, SRC2)
{
IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST := SRC2;
ELSE IF (SRC1 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC2 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC1 < SRC2) THEN DEST := SRC1;
ELSE DEST := SRC2;
FI;
}
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Compares the low double precision floating-point values in the first source operand and the second source
operand, and returns the minimum value to the low quadword of the destination operand. When the source
operand is a memory operand, only the 64 bits are accessed.
If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If
a value in the second source operand is an SNaN, then SNaN is returned unchanged to the destination (that is, a
QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid
floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source operand
(from either the first or second source) be returned, the action of MINSD can be emulated using a sequence of
instructions, such as, a comparison followed by AND, ANDN, and OR.
The second source operand can be an XMM register or a 64-bit memory location. The first source and destination
operands are XMM registers.
128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL-1:64) of the
corresponding destination register remain unchanged.
VEX.128 and EVEX encoded version: Bits (127:64) of the XMM register destination are copied from corresponding
bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
EVEX encoded version: The low quadword element of the destination operand is updated according to the write-
mask.
Software should ensure VMINSD is encoded with VEX.L=0. Encoding VMINSD with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Compares the low single precision floating-point values in the first source operand and the second source operand
and returns the minimum value to the low doubleword of the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If
a value in the second operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a QNaN
version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid
floating-point value, is written to the result. If instead of this behavior, it is required that the NaN in either source
operand be returned, the action of MINSD can be emulated using a sequence of instructions, such as, a comparison
followed by AND, ANDN, and OR.
The second source operand can be an XMM register or a 32-bit memory location. The first source and destination
operands are XMM registers.
128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL:32) of the corre-
sponding destination register remain unchanged.
VEX.128 and EVEX encoded version: The first source operand is an xmm register encoded by (E)VEX.vvvv. Bits
(127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits
(MAXVL-1:128) of the destination register are zeroed.
EVEX encoded version: The low doubleword element of the destination operand is updated according to the write-
mask.
Software should ensure VMINSS is encoded with VEX.L=0. Encoding VMINSS with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Moves 2, 4 or 8 double precision floating-point values from the source operand (second operand) to the destination
operand (first operand). This instruction can be used to load an XMM, YMM or ZMM register from an 128-bit, 256-
Operation
VMOVAPD (EVEX Encoded Versions, Register-Copy Form)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE DEST[i+63:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FI;
ENDFOR;
Other Exceptions
Non-EVEX-encoded instruction, see Exceptions Type1.SSE2 in Table 2-18, “Type 1 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-46, “Type E1 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Moves 4, 8 or 16 single precision floating-point values from the source operand (second operand) to the destination
operand (first operand). This instruction can be used to load an XMM, YMM or ZMM register from an 128-bit, 256-
Operation
VMOVAPS (EVEX Encoded Versions, Register-Copy Form)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Exceptions Type1.SSE in Table 2-18, “Type 1 Class Exception Conditions,”
additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instruction, see Table 2-46, “Type E1 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
For 256-bit or higher versions: Duplicates even-indexed double precision floating-point values from the source
operand (the second operand) and into adjacent pair and store to the destination operand (the first operand).
For 128-bit versions: Duplicates the low double precision floating-point value from the source operand (the second
operand) and store to the destination operand (the first operand).
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register are unchanged. The
source operand is XMM register or a 64-bit memory location.
VEX.128 and EVEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed. The source
operand is XMM register or a 64-bit memory location. The destination is updated conditionally under the writemask
for EVEX version.
VEX.256 and EVEX.256 encoded version: Bits (MAXVL-1:256) of the destination register are zeroed. The source
operand is YMM register or a 256-bit memory location. The destination is updated conditionally under the write-
mask for EVEX version.
EVEX.512 encoded version: The destination is updated according to the writemask. The source operand is ZMM
register or a 512-bit memory location.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
DEST X2 X2 X0 X0
Operation
VMOVDDUP (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
TMP_SRC[63:0] := SRC[63:0]
TMP_SRC[127:64] := SRC[63:0]
IF VL >= 256
TMP_SRC[191:128] := SRC[191:128]
TMP_SRC[255:192] := SRC[191:128]
FI;
IF VL >= 512
TMP_SRC[319:256] := SRC[319:256]
TMP_SRC[383:320] := SRC[319:256]
TMP_SRC[477:384] := SRC[477:384]
TMP_SRC[511:484] := SRC[477:384]
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_SRC[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-54, “Type E5NF Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.
NOTES:
1. For this specific instruction, VEX.W/EVEX.W in non-64 bit is ignored; the instruction behaves as if the W0 version is used.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Copies a doubleword from the source operand (second operand) to the destination operand (first operand). The
source and destination operands can be general-purpose registers, MMX technology registers, XMM registers, or
32-bit memory locations. This instruction can be used to move a doubleword to and from the low doubleword of an
MMX technology register and a general-purpose register or a 32-bit memory location, or to and from the low
doubleword of an XMM register and a general-purpose register or a 32-bit memory location. The instruction cannot
be used to transfer data between MMX technology registers, between XMM registers, between general-purpose
registers, or between memory locations.
When the destination operand is an MMX technology register, the source operand is written to the low doubleword
of the register, and the register is zero-extended to 64 bits. When the destination operand is an XMM register, the
source operand is written to the low doubleword of the register, and the register is zero-extended to 128 bits.
In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.B prefix permits access to addi-
tional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See the summary chart at the
beginning of this section for encoding data and limits.
MOVD/Q with XMM destination:
Moves a dword/qword integer from the source operand and stores it in the low 32/64-bits of the destination XMM
register. The upper bits of the destination are zeroed. The source operand can be a 32/64-bit register or 32/64-bit
memory location.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding YMM destination register remain
unchanged. Qword operation requires the use of REX.W=1.
VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed. Qword operation requires the
use of VEX.W=1.
EVEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed. Qword operation requires
the use of EVEX.W=1.
MOVD/Q with 32/64 reg/mem destination:
Stores the low dword/qword of the source XMM register to 32/64-bit memory location or general-purpose register.
Qword operation requires the use of REX.W=1, VEX.W=1, or EVEX.W=1.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
If VMOVD or VMOVQ is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will
cause an #UD exception.
Operation
MOVD (When Destination Operand is an MMX Technology Register)
DEST[31:0] := SRC;
DEST[63:32] := 00000000H;
Flags Affected
None.
Description
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
EVEX encoded versions:
Moves 128, 256 or 512 bits of packed doubleword/quadword integer values from the source operand (the second
operand) to the destination operand (the first operand). This instruction can be used to load a vector register from
an int32/int64 memory location, to store the contents of a vector register into an int32/int64 memory location, or
to move data between two ZMM registers. When the source or destination operand is a memory operand, the
operand must be aligned on a 16 (EVEX.128)/32(EVEX.256)/64(EVEX.512)-byte boundary or a general-protection
exception (#GP) will be generated. To move integer data to and from unaligned memory locations, use the
VMOVDQU instruction.
The destination operand is updated at 32-bit (VMOVDQA32) or 64-bit (VMOVDQA64) granularity according to the
writemask.
VEX.256 encoded version:
Moves 256 bits of packed integer values from the source operand (second operand) to the destination operand
(first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the
contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers.
When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte boundary
or a general-protection exception (#GP) will be generated. To move integer data to and from unaligned memory
locations, use the VMOVDQU instruction. Bits (MAXVL-1:256) of the destination register are zeroed.
128-bit versions:
Moves 128 bits of packed integer values from the source operand (second operand) to the destination operand
(first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the
contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.
When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary
or a general-protection exception (#GP) will be generated. To move integer data to and from unaligned memory
locations, use the VMOVDQU instruction.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding ZMM destination register remain
unchanged.
VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.
Other Exceptions
Non-EVEX-encoded instruction, see Exceptions Type1.SSE2 in Table 2-18, “Type 1 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-46, “Type E1 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
EVEX encoded versions:
Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand
(the second operand) to the destination operand (first operand). This instruction can be used to load a vector
register from a memory location, to store the contents of a vector register into a memory location, or to move data
between two vector registers.
The destination operand is updated at 8-bit (VMOVDQU8), 16-bit (VMOVDQU16), 32-bit (VMOVDQU32), or 64-bit
(VMOVDQU64) granularity according to the writemask.
VEX.256 encoded version:
Moves 256 bits of packed integer values from the source operand (second operand) to the destination operand
(first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the
contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers.
Bits (MAXVL-1:256) of the destination register are zeroed.
128-bit versions:
Moves 128 bits of packed integer values from the source operand (second operand) to the destination operand
(first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the
contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.
When the source or destination operand is a memory operand, the operand may be unaligned to any alignment
without causing a general-protection exception (#GP) to be generated
VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.
Operation
VMOVDQU8 (EVEX Encoded Versions, Register-Copy Form)
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SRC[i+7:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE DEST[i+7:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FI;
ENDFOR;
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction cannot be used for memory to register moves.
128-bit two-argument form:
Moves two packed single precision floating-point values from the high quadword of the second XMM argument
(second operand) to the low quadword of the first XMM register (first argument). The quadword at bits 127:64 of
the destination operand is left unchanged. Bits (MAXVL-1:128) of the corresponding destination register remain
unchanged.
128-bit and EVEX three-argument form:
Moves two packed single precision floating-point values from the high quadword of the third XMM argument (third
operand) to the low quadword of the destination (first operand). Copies the high quadword from the second XMM
argument (second operand) to the high quadword of the destination (first operand). Bits (MAXVL-1:128) of the
corresponding destination register are zeroed.
If VMOVHLPS is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or
EVEX.L’L= 1 will cause an #UD exception.
Operation
MOVHLPS (128-bit Two-Argument Form)
DEST[63:0] := SRC[127:64]
DEST[MAXVL-1:64] (Unmodified)
MOVHLPS—Move Packed Single Precision Floating-Point Values High to Low Vol. 2B 4-78
Intel C/C++ Compiler Intrinsic Equivalent
MOVHLPS __m128 _mm_movehl_ps(__m128 a, __m128 b)
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-24, “Type 7 Class Exception Conditions,” additionally:
#UD If VEX.L = 1.
EVEX-encoded instruction, see Exceptions Type E7NM.128 in Table 2-57, “Type E7NM Class Exception Conditions.”
MOVHLPS—Move Packed Single Precision Floating-Point Values High to Low Vol. 2B 4-79
MOVHPD—Move High Packed Double Precision Floating-Point Value
Opcode/ Op / En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
66 0F 16 /r A V/V SSE2 Move double precision floating-point value from m64
MOVHPD xmm1, m64 to high quadword of xmm1.
VEX.128.66.0F.WIG 16 /r B V/V AVX Merge double precision floating-point value from m64
VMOVHPD xmm2, xmm1, m64 and the low quadword of xmm1.
EVEX.128.66.0F.W1 16 /r D V/V AVX512F Merge double precision floating-point value from m64
VMOVHPD xmm2, xmm1, m64 OR AVX10.11 and the low quadword of xmm1.
66 0F 17 /r C V/V SSE2 Move double precision floating-point value from high
MOVHPD m64, xmm1 quadword of xmm1 to m64.
VEX.128.66.0F.WIG 17 /r C V/V AVX Move double precision floating-point value from high
VMOVHPD m64, xmm1 quadword of xmm1 to m64.
EVEX.128.66.0F.W1 17 /r E V/V AVX512F Move double precision floating-point value from high
VMOVHPD m64, xmm1 OR AVX10.11 quadword of xmm1 to m64.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction cannot be used for register to register or memory to memory moves.
128-bit Legacy SSE load:
Moves a double precision floating-point value from the source 64-bit memory operand and stores it in the high 64-
bits of the destination XMM register. The lower 64bits of the XMM register are preserved. Bits (MAXVL-1:128) of the
corresponding destination register are preserved.
VEX.128 & EVEX encoded load:
Loads a double precision floating-point value from the source 64-bit memory operand (the third operand) and
stores it in the upper 64-bits of the destination XMM register (first operand). The low 64-bits from the first source
operand (second operand) are copied to the low 64-bits of the destination. Bits (MAXVL-1:128) of the corre-
sponding destination register are zeroed.
128-bit store:
Stores a double precision floating-point value from the high 64-bits of the XMM register source (second operand)
to the 64-bit memory location (first operand).
Note: VMOVHPD (store) (VEX.128.66.0F 17 /r) is legal and has the same behavior as the existing 66 0F 17 store.
For VMOVHPD (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD.
If VMOVHPD is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or
EVEX.L’L= 1 will cause an #UD exception.
VMOVHPD (Store)
DEST[63:0] := SRC[127:64]
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions,” additionally:
#UD If VEX.L = 1.
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction cannot be used for register to register or memory to memory moves.
128-bit Legacy SSE load:
Moves two packed single precision floating-point values from the source 64-bit memory operand and stores them
in the high 64-bits of the destination XMM register. The lower 64bits of the XMM register are preserved. Bits
(MAXVL-1:128) of the corresponding destination register are preserved.
VEX.128 & EVEX encoded load:
Loads two single precision floating-point values from the source 64-bit memory operand (the third operand) and
stores it in the upper 64-bits of the destination XMM register (first operand). The low 64-bits from the first source
operand (the second operand) are copied to the lower 64-bits of the destination. Bits (MAXVL-1:128) of the corre-
sponding destination register are zeroed.
128-bit store:
Stores two packed single precision floating-point values from the high 64-bits of the XMM register source (second
operand) to the 64-bit memory location (first operand).
Note: VMOVHPS (store) (VEX.128.0F 17 /r) is legal and has the same behavior as the existing 0F 17 store. For
VMOVHPS (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD.
If VMOVHPS is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or
EVEX.L’L= 1 will cause an #UD exception.
VMOVHPS (Store)
DEST[63:0] := SRC[127:64]
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions,” additionally:
#UD If VEX.L = 1.
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction cannot be used for memory to register moves.
128-bit two-argument form:
Moves two packed single precision floating-point values from the low quadword of the second XMM argument
(second operand) to the high quadword of the first XMM register (first argument). The low quadword of the desti-
nation operand is left unchanged. Bits (MAXVL-1:128) of the corresponding destination register are unmodified.
128-bit three-argument forms:
Moves two packed single precision floating-point values from the low quadword of the third XMM argument (third
operand) to the high quadword of the destination (first operand). Copies the low quadword from the second XMM
argument (second operand) to the low quadword of the destination (first operand). Bits (MAXVL-1:128) of the
corresponding destination register are zeroed.
If VMOVLHPS is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or
EVEX.L’L= 1 will cause an #UD exception.
Operation
MOVLHPS (128-bit Two-Argument Form)
DEST[63:0] (Unmodified)
DEST[127:64] := SRC[63:0]
DEST[MAXVL-1:128] (Unmodified)
MOVLHPS—Move Packed Single Precision Floating-Point Values Low to High Vol. 2B 4-84
Intel C/C++ Compiler Intrinsic Equivalent
MOVLHPS __m128 _mm_movelh_ps(__m128 a, __m128 b)
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-24, “Type 7 Class Exception Conditions,” additionally:
#UD If VEX.L = 1.
EVEX-encoded instruction, see Exceptions Type E7NM.128 in Table 2-57, “Type E7NM Class Exception Conditions.”
MOVLHPS—Move Packed Single Precision Floating-Point Values Low to High Vol. 2B 4-85
MOVLPD—Move Low Packed Double Precision Floating-Point Value
Opcode/ Op / En 64/32 bit CPUID Description
Instruction Mode Feature Flag
Support
66 0F 12 /r A V/V SSE2 Move double precision floating-point value from m64 to
MOVLPD xmm1, m64 low quadword of xmm1.
VEX.128.66.0F.WIG 12 /r B V/V AVX Merge double precision floating-point value from m64 and
VMOVLPD xmm2, xmm1, m64 the high quadword of xmm1.
EVEX.128.66.0F.W1 12 /r D V/V AVX512F Merge double precision floating-point value from m64 and
VMOVLPD xmm2, xmm1, m64 OR AVX10.11 the high quadword of xmm1.
66 0F 13/r C V/V SSE2 Move double precision floating-point value from low
MOVLPD m64, xmm1 quadword of xmm1 to m64.
VEX.128.66.0F.WIG 13/r C V/V AVX Move double precision floating-point value from low
VMOVLPD m64, xmm1 quadword of xmm1 to m64.
EVEX.128.66.0F.W1 13/r E V/V AVX512F Move double precision floating-point value from low
VMOVLPD m64, xmm1 OR AVX10.11 quadword of xmm1 to m64.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction cannot be used for register to register or memory to memory moves.
128-bit Legacy SSE load:
Moves a double precision floating-point value from the source 64-bit memory operand and stores it in the low 64-
bits of the destination XMM register. The upper 64bits of the XMM register are preserved. Bits (MAXVL-1:128) of the
corresponding destination register are preserved.
VEX.128 & EVEX encoded load:
Loads a double precision floating-point value from the source 64-bit memory operand (third operand), merges it
with the upper 64-bits of the first source XMM register (second operand), and stores it in the low 128-bits of the
destination XMM register (first operand). Bits (MAXVL-1:128) of the corresponding destination register are zeroed.
128-bit store:
Stores a double precision floating-point value from the low 64-bits of the XMM register source (second operand) to
the 64-bit memory location (first operand).
Note: VMOVLPD (store) (VEX.128.66.0F 13 /r) is legal and has the same behavior as the existing 66 0F 13 store.
For VMOVLPD (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD.
If VMOVLPD is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or
EVEX.L’L= 1 will cause an #UD exception.
VMOVLPD (Store)
DEST[63:0] := SRC[63:0]
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions,” additionally:
#UD If VEX.L = 1.
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction cannot be used for register to register or memory to memory moves.
128-bit Legacy SSE load:
Moves two packed single precision floating-point values from the source 64-bit memory operand and stores them
in the low 64-bits of the destination XMM register. The upper 64bits of the XMM register are preserved. Bits
(MAXVL-1:128) of the corresponding destination register are preserved.
VEX.128 & EVEX encoded load:
Loads two packed single precision floating-point values from the source 64-bit memory operand (the third
operand), merges them with the upper 64-bits of the first source operand (the second operand), and stores them
in the low 128-bits of the destination register (the first operand). Bits (MAXVL-1:128) of the corresponding desti-
nation register are zeroed.
128-bit store:
Loads two packed single precision floating-point values from the low 64-bits of the XMM register source (second
operand) to the 64-bit memory location (first operand).
Note: VMOVLPS (store) (VEX.128.0F 13 /r) is legal and has the same behavior as the existing 0F 13 store. For
VMOVLPS (store) VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instruction will #UD.
If VMOVLPS is encoded with VEX.L or EVEX.L’L= 1, an attempt to execute the instruction encoded with VEX.L or
EVEX.L’L= 1 will cause an #UD exception.
VMOVLPS (Store)
DEST[63:0] := SRC[63:0]
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions,” additionally:
#UD If VEX.L = 1.
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Moves the packed integers in the source operand (second operand) to the destination operand (first operand) using
a non-temporal hint to prevent caching of the data during the write to memory. The source operand is an XMM
register, YMM register or ZMM register, which is assumed to contain integer data (packed bytes, words, double-
words, or quadwords). The destination operand is a 128-bit, 256-bit or 512-bit memory location. The memory
operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte (512-bit
version) boundary otherwise a general-protection exception (#GP) will be generated.
The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the
data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it
fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being
written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an
uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see
“Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s
Manual, Volume 1.
Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with
the SFENCE or MFENCE instruction should be used in conjunction with VMOVNTDQ instructions if multiple proces-
sors might use different memory types to read/write the destination memory locations.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, VEX.L must be 0; otherwise instructions will
#UD.
1. ModRM.MOD != 011B
Other Exceptions
Non-EVEX-encoded instruction, see Exceptions Type1.SSE2 in Table 2-18, “Type 1 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-47, “Type E1NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
MOVNTDQA loads a double quadword from the source operand (second operand) to the destination operand (first
operand) using a non-temporal hint if the memory source is WC (write combining) memory type. For WC memory
type, the non-temporal hint may be implemented by loading a temporary internal buffer with the equivalent of an
aligned cache line without filling this data to the cache. Any memory-type aliased lines in the cache will be snooped
and flushed. Subsequent MOVNTDQA reads to unread portions of the WC cache line will receive data from the
temporary internal buffer if data is available. The temporary internal buffer may be flushed by the processor at any
time for any reason, for example:
• A load operation other than a MOVNTDQA which references memory already resident in a temporary internal
buffer.
• A non-WC reference to memory already resident in a temporary internal buffer.
• Interleaving of reads and writes to a single temporary internal buffer.
• Repeated (V)MOVNTDQA loads of a particular 16-byte item in a streaming line.
• Certain micro-architectural conditions including resource shortages, detection of a mis-speculation condition,
and various fault conditions.
The non-temporal hint is implemented by using a write combining (WC) memory type protocol when reading the
data from memory. Using this protocol, the processor does not read the data into the cache hierarchy, nor does it
fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being
read can override the non-temporal hint, if the memory address specified for the non-temporal read is not a WC
1. ModRM.MOD != 011B
Operation
MOVNTDQA (128bit- Legacy SSE Form)
DEST := SRC
DEST[MAXVL-1:128] (Unmodified)
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-18, “Type 1 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-47, “Type E1NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Moves the packed double precision floating-point values in the source operand (second operand) to the destination
operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory. The
source operand is an XMM register, YMM register or ZMM register, which is assumed to contain packed double preci-
sion, floating-pointing data. The destination operand is a 128-bit, 256-bit or 512-bit memory location. The memory
operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte
(EVEX.512 encoded version) boundary otherwise a general-protection exception (#GP) will be generated.
The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the
data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it
fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being
written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an
uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see
“Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s
Manual, Volume 1.
Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with
the SFENCE or MFENCE instruction should be used in conjunction with MOVNTPD instructions if multiple processors
might use different memory types to read/write the destination memory locations.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, VEX.L must be 0; otherwise instructions will
#UD.
1. ModRM.MOD != 011B
MOVNTPD—Store Packed Double Precision Floating-Point Values Using Non-Temporal Hint Vol. 2B 4-100
Operation
VMOVNTPD (EVEX Encoded Versions)
VL = 128, 256, 512
DEST[VL-1:0] := SRC[VL-1:0]
DEST[MAXVL-1:VL] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Exceptions Type1.SSE2 in Table 2-18, “Type 1 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-47, “Type E1NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
MOVNTPD—Store Packed Double Precision Floating-Point Values Using Non-Temporal Hint Vol. 2B 4-101
MOVNTPS—Store Packed Single Precision Floating-Point Values Using Non-Temporal Hint
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 2B /r A V/V SSE Move packed single precision values xmm1 to mem using
MOVNTPS m128, xmm1 non-temporal hint.
VEX.128.0F.WIG 2B /r A V/V AVX Move packed single precision values xmm1 to mem using
VMOVNTPS m128, xmm1 non-temporal hint.
VEX.256.0F.WIG 2B /r A V/V AVX Move packed single precision values ymm1 to mem using
VMOVNTPS m256, ymm1 non-temporal hint.
EVEX.128.0F.W0 2B /r B V/V (AVX512VL AND Move packed single precision values in xmm1 to m128
VMOVNTPS m128, xmm1 AVX512F) OR using non-temporal hint.
AVX10.11
EVEX.256.0F.W0 2B /r B V/V (AVX512VL AND Move packed single precision values in ymm1 to m256
VMOVNTPS m256, ymm1 AVX512F) OR using non-temporal hint.
AVX10.11
EVEX.512.0F.W0 2B /r B V/V AVX512F Move packed single precision values in zmm1 to m512
VMOVNTPS m512, zmm1 OR AVX10.11 using non-temporal hint.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Moves the packed single precision floating-point values in the source operand (second operand) to the destination
operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory. The
source operand is an XMM register, YMM register or ZMM register, which is assumed to contain packed single preci-
sion, floating-pointing. The destination operand is a 128-bit, 256-bit or 512-bit memory location. The memory
operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte
(EVEX.512 encoded version) boundary otherwise a general-protection exception (#GP) will be generated.
The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the
data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it
fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being
written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an
uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see
“Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s
Manual, Volume 1.
Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with
the SFENCE or MFENCE instruction should be used in conjunction with MOVNTPS instructions if multiple processors
might use different memory types to read/write the destination memory locations.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
1. ModRM.MOD != 011B
MOVNTPS—Store Packed Single Precision Floating-Point Values Using Non-Temporal Hint Vol. 2B 4-102
Operation
VMOVNTPS (EVEX Encoded Versions)
VL = 128, 256, 512
DEST[VL-1:0] := SRC[VL-1:0]
DEST[MAXVL-1:VL] := 0
MOVNTPS
DEST := SRC
Other Exceptions
Non-EVEX-encoded instruction, see Exceptions Type1.SSE in Table 2-18, “Type 1 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-47, “Type E1NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
MOVNTPS—Store Packed Single Precision Floating-Point Values Using Non-Temporal Hint Vol. 2B 4-103
MOVQ—Move Quadword
Opcode/ Op/ En 64/32-bit CPUID Description
Instruction Mode Feature Flag
NP 0F 6F /r A V/V MMX Move quadword from mm/m64 to mm.
MOVQ mm, mm/m64
NP 0F 7F /r B V/V MMX Move quadword from mm to mm/m64.
MOVQ mm/m64, mm
F3 0F 7E /r A V/V SSE2 Move quadword from xmm2/mem64 to xmm1.
MOVQ xmm1, xmm2/m64
VEX.128.F3.0F.WIG 7E /r A V/V AVX Move quadword from xmm2 to xmm1.
VMOVQ xmm1, xmm2/m64
EVEX.128.F3.0F.W1 7E /r C V/V AVX512F Move quadword from xmm2/m64 to xmm1.
VMOVQ xmm1, xmm2/m64 OR AVX10.11
66 0F D6 /r B V/V SSE2 Move quadword from xmm1 to xmm2/mem64.
MOVQ xmm2/m64, xmm1
VEX.128.66.0F.WIG D6 /r B V/V AVX Move quadword from xmm2 register to
VMOVQ xmm1/m64, xmm2 xmm1/m64.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Copies a quadword from the source operand (second operand) to the destination operand (first operand). The
source and destination operands can be MMX technology registers, XMM registers, or 64-bit memory locations.
This instruction can be used to move a quadword between two MMX technology registers or between an MMX tech-
nology register and a 64-bit memory location, or to move data between two XMM registers or between an XMM
register and a 64-bit memory location. The instruction cannot be used to transfer data between memory locations.
When the source operand is an XMM register, the low quadword is moved; when the destination operand is an XMM
register, the quadword is stored to the low quadword of the register, and the high quadword is cleared to all 0s.
In 64-bit mode and if not encoded using VEX/EVEX, use of the REX prefix in the form of REX.R permits this instruc-
tion to access additional registers (XMM8-XMM15).
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
If VMOVQ is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an
#UD exception.
MOVQ Instruction When Source and Destination Operands are XMM Registers
DEST[63:0] := SRC[63:0];
DEST[127:64] := 0000000000000000H;
VMOVQ (7E - EVEX Encoded Version) With XMM Register Source and Destination
DEST[63:0] := SRC[63:0]
DEST[MAXVL-1:64] := 0
VMOVQ (D6 - EVEX Encoded Version) With XMM Register Source and Destination
DEST[63:0] := SRC[63:0]
DEST[MAXVL-1:64] := 0
Flags Affected
None.
Other Exceptions
See Table 24-8, “Exception Conditions for Legacy SIMD/MMX Instructions without FP Exception,” in the Intel® 64
and IA-32 Architectures Software Developer’s Manual, Volume 3B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Moves a scalar double precision floating-point value from the source operand (second operand) to the destination
operand (first operand). The source and destination operands can be XMM registers or 64-bit memory locations.
This instruction can be used to move a double precision floating-point value to and from the low quadword of an
Operation
VMOVSD (EVEX.LLIG.F2.0F 10 /r: VMOVSD xmm1, m64 With Support for 32 Registers)
IF k1[0] or *no writemask*
THEN DEST[63:0] := SRC[63:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[MAXVL-1:64] := 0
VMOVSD (EVEX.LLIG.F2.0F 11 /r: VMOVSD m64, xmm1 With Support for 32 Registers)
IF k1[0] or *no writemask*
THEN DEST[63:0] := SRC[63:0]
ELSE *DEST[63:0] remains unchanged* ; merging-masking
FI;
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instruction, see Table 2-60, “Type E10 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Duplicates odd-indexed single precision floating-point values from the source operand (the second operand) to
adjacent element pair in the destination operand (the first operand). See Figure 4-3. The source operand is an
XMM, YMM or ZMM register or 128, 256 or 512-bit memory location and the destination operand is an XMM, YMM
or ZMM register.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.
VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.
VEX.256 encoded version: Bits (MAXVL-1:256) of the destination register are zeroed.
EVEX encoded version: The destination operand is updated at 32-bit granularity according to the writemask.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
DEST X7 X7 X5 X5 X3 X3 X1 X1
Operation
VMOVSHDUP (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
TMP_SRC[31:0] := SRC[63:32]
TMP_SRC[63:32] := SRC[63:32]
TMP_SRC[95:64] := SRC[127:96]
TMP_SRC[127:96] := SRC[127:96]
IF VL >= 256
TMP_SRC[159:128] := SRC[191:160]
TMP_SRC[191:160] := SRC[191:160]
TMP_SRC[223:192] := SRC[255:224]
TMP_SRC[255:224] := SRC[255:224]
FI;
IF VL >= 512
TMP_SRC[287:256] := SRC[319:288]
TMP_SRC[319:288] := SRC[319:288]
TMP_SRC[351:320] := SRC[383:352]
TMP_SRC[383:352] := SRC[383:352]
TMP_SRC[415:384] := SRC[447:416]
TMP_SRC[447:416] := SRC[447:416]
TMP_SRC[479:448] := SRC[511:480]
TMP_SRC[511:480] := SRC[511:480]
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_SRC[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Duplicates even-indexed single precision floating-point values from the source operand (the second operand). See
Figure 4-4. The source operand is an XMM, YMM or ZMM register or 128, 256 or 512-bit memory location and the
destination operand is an XMM, YMM or ZMM register.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.
VEX.128 encoded version: Bits (MAXVL-1:128) of the destination register are zeroed.
VEX.256 encoded version: Bits (MAXVL-1:256) of the destination register are zeroed.
EVEX encoded version: The destination operand is updated at 32-bit granularity according to the writemask.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
DEST X6 X6 X4 X4 X2 X2 X0 X0
Operation
VMOVSLDUP (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
TMP_SRC[31:0] := SRC[31:0]
TMP_SRC[63:32] := SRC[31:0]
TMP_SRC[95:64] := SRC[95:64]
TMP_SRC[127:96] := SRC[95:64]
IF VL >= 256
TMP_SRC[159:128] := SRC[159:128]
TMP_SRC[191:160] := SRC[159:128]
TMP_SRC[223:192] := SRC[223:192]
TMP_SRC[255:224] := SRC[223:192]
FI;
IF VL >= 512
TMP_SRC[287:256] := SRC[287:256]
TMP_SRC[319:288] := SRC[287:256]
TMP_SRC[351:320] := SRC[351:320]
TMP_SRC[383:352] := SRC[351:320]
TMP_SRC[415:384] := SRC[415:384]
TMP_SRC[447:416] := SRC[415:384]
TMP_SRC[479:448] := SRC[479:448]
TMP_SRC[511:480] := SRC[479:448]
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_SRC[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B or VEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Moves a scalar single precision floating-point value from the source operand (second operand) to the destination
operand (first operand). The source and destination operands can be XMM registers or 32-bit memory locations.
This instruction can be used to move a single precision floating-point value to and from the low doubleword of an
Operation
VMOVSS (EVEX.LLIG.F3.0F.W0 11 /r When the Source Operand is Memory and the Destination is an XMM Register)
IF k1[0] or *no writemask*
THEN DEST[31:0] := SRC[31:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[MAXVL-1:32] := 0
VMOVSS (EVEX.LLIG.F3.0F.W0 10 /r When the Source Operand is an XMM Register and the Destination is Memory)
IF k1[0] or *no writemask*
THEN DEST[31:0] := SRC[31:0]
ELSE *DEST[31:0] remains unchanged* ; merging-masking
FI;
VMOVSS (EVEX.LLIG.F3.0F.W0 10/11 /r Where the Source and Destination are XMM Registers)
IF k1[0] or *no writemask*
THEN DEST[31:0] := SRC2[31:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0
VMOVSS (VEX.128.F3.0F 10 /r Where the Source and Destination are XMM Registers)
DEST[31:0] := SRC2[31:0]
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0
VMOVSS (VEX.128.F3.0F 10 /r When the Source Operand is Memory and the Destination is an XMM Register)
DEST[31:0] := SRC[31:0]
DEST[MAXVL-1:32] := 0
MOVSS/VMOVSS (When the Source Operand is an XMM Register and the Destination is Memory)
DEST[31:0] := SRC[31:0]
MOVSS (Legacy SSE Version when the Source Operand is Memory and the Destination is an XMM Register)
DEST[31:0] := SRC[31:0]
DEST[127:32] := 0
DEST[MAXVL-1:128] (Unmodified)
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instruction, see Table 2-60, “Type E10 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Operation
VMOVUPD (EVEX Encoded Versions, Register-Copy Form)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE DEST[i+63:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FI;
ENDFOR;
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Operation
VMOVUPS (EVEX Encoded Versions, Register-Copy Form)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Multiply packed double precision floating-point values from the first source operand with corresponding values in
the second source operand, and stores the packed double precision floating-point results in the destination
operand.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. Bits (MAXVL-1:256) of the corre-
sponding destination ZMM register are zeroed.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the destination YMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Multiply the packed single precision floating-point values from the first source operand with the corresponding
values in the second source operand, and stores the packed double precision floating-point results in the destina-
tion operand.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. Bits (MAXVL-1:256) of the corre-
sponding destination ZMM register are zeroed.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the destination YMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Multiplies the low double precision floating-point value in the second source operand by the low double precision
floating-point value in the first source operand, and stores the double precision floating-point result in the destina-
tion operand. The second source operand can be an XMM register or a 64-bit memory location. The first source
operand and the destination operands are XMM registers.
128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAXVL-
1:64) of the corresponding destination register remain unchanged.
VEX.128 and EVEX encoded version: The quadword at bits 127:64 of the destination operand is copied from the
same bits of the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
EVEX encoded version: The low quadword element of the destination operand is updated according to the write-
mask.
Software should ensure VMULSD is encoded with VEX.L=0. Encoding VMULSD with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Multiplies the low single precision floating-point value from the second source operand by the low single precision
floating-point value in the first source operand, and stores the single precision floating-point result in the destina-
tion operand. The second source operand can be an XMM register or a 32-bit memory location. The first source
operand and the destination operands are XMM registers.
128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAXVL-
1:32) of the corresponding YMM destination register remain unchanged.
VEX.128 and EVEX encoded version: The first source operand is an xmm register encoded by VEX.vvvv. The three
high-order doublewords of the destination operand are copied from the first source operand. Bits (MAXVL-1:128)
of the destination register are zeroed.
EVEX encoded version: The low doubleword element of the destination operand is updated according to the write-
mask.
Software should ensure VMULSS is encoded with VEX.L=0. Encoding VMULSS with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a bitwise logical OR of the two, four or eight packed double precision floating-point values from the first
source operand and the second source operand, and stores the result in the destination operand.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a
32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the
corresponding ZMM register destination are zeroed.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
register destination are unmodified.
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := SRC1[i+63:i] BITWISE OR SRC2[63:0]
ELSE
DEST[i+63:i] := SRC1[i+63:i] BITWISE OR SRC2[i+63:i]
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a bitwise logical OR of the four, eight or sixteen packed single precision floating-point values from the first
source operand and the second source operand, and stores the result in the destination operand
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a
32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with write-
mask k1.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register. The upper bits (MAXVL-1:256) of the
corresponding ZMM register destination are zeroed.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
register destination are unmodified.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
VEX.128.66.0F38.WIG 1D /r A V/V AVX Compute the absolute value of 16- bit integers in
VPABSW xmm1, xmm2/m128 xmm2/m128 and store UNSIGNED result in xmm1.
VEX.128.66.0F38.WIG 1E /r A V/V AVX Compute the absolute value of 32- bit integers in
VPABSD xmm1, xmm2/m128 xmm2/m128 and store UNSIGNED result in xmm1.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
PABSB/W/D computes the absolute value of each data element of the source operand (the second operand) and
stores the UNSIGNED results in the destination operand (the first operand). PABSB operates on signed bytes,
PABSW operates on signed 16-bit words, and PABSD operates on signed 32-bit integers.
EVEX encoded VPABSD/Q: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location,
or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. The destination operand is a
ZMM/YMM/XMM register updated according to the writemask.
EVEX encoded VPABSB/W: The source operand is a ZMM/YMM/XMM register, or a 512/256/128-bit memory loca-
tion. The destination operand is a ZMM/YMM/XMM register updated according to the writemask.
VEX.256 encoded versions: The source operand is a YMM register or a 256-bit memory location. The destination
operand is a YMM register. The upper bits (MAXVL-1:256) of the corresponding register destination are zeroed.
VEX.128 encoded versions: The source operand is an XMM register or 128-bit memory location. The destination
operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding register destination are zeroed.
Operation
PABSB With 64-bit Operands:
Unsigned DEST[7:0] := ABS(SRC[7: 0])
Repeat operation for 2nd through 7th bytes
Unsigned DEST[63:56] := ABS(SRC[63:56])
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN
Unsigned DEST[i+7:i] := ABS(SRC[i+7:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN
Unsigned DEST[i+15:i] := ABS(SRC[i+15:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPABSD/Q, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX-encoded VPABSB/W, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts packed signed word integers into packed signed byte integers (PACKSSWB) or converts packed signed
doubleword integers into packed signed word integers (PACKSSDW), using saturation to handle overflow condi-
tions. See Figure 4-6 for an example of the packing operation.
D’ C’ B’ A’
64-Bit DEST
PACKSSWB converts packed signed word integers in the first and second source operands into packed signed byte
integers using signed saturation to handle overflow conditions beyond the range of signed byte integers. If the
signed word value is beyond the range of a signed byte value (i.e., greater than 7FH or less than 80H), the satu-
rated signed byte integer value of 7FH or 80H, respectively, is stored in the destination. PACKSSDW converts
packed signed doubleword integers in the first and second source operands into packed signed word integers using
signed saturation to handle overflow conditions beyond 7FFFH and 8000H.
EVEX encoded PACKSSWB: The first source operand is a ZMM/YMM/XMM register. The second source operand is a
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM
register, updated conditional under the writemask k1.
Operation
PACKSSWB Instruction (128-bit Legacy SSE Version)
DEST[7:0] := SaturateSignedWordToSignedByte (DEST[15:0]);
DEST[15:8] := SaturateSignedWordToSignedByte (DEST[31:16]);
DEST[23:16] := SaturateSignedWordToSignedByte (DEST[47:32]);
DEST[31:24] := SaturateSignedWordToSignedByte (DEST[63:48]);
DEST[39:32] := SaturateSignedWordToSignedByte (DEST[79:64]);
DEST[47:40] := SaturateSignedWordToSignedByte (DEST[95:80]);
DEST[55:48] := SaturateSignedWordToSignedByte (DEST[111:96]);
DEST[63:56] := SaturateSignedWordToSignedByte (DEST[127:112]);
DEST[71:64] := SaturateSignedWordToSignedByte (SRC[15:0]);
DEST[79:72] := SaturateSignedWordToSignedByte (SRC[31:16]);
DEST[87:80] := SaturateSignedWordToSignedByte (SRC[47:32]);
DEST[95:88] := SaturateSignedWordToSignedByte (SRC[63:48]);
DEST[103:96] := SaturateSignedWordToSignedByte (SRC[79:64]);
DEST[111:104] := SaturateSignedWordToSignedByte (SRC[95:80]);
DEST[119:112] := SaturateSignedWordToSignedByte (SRC[111:96]);
DEST[127:120] := SaturateSignedWordToSignedByte (SRC[127:112]);
DEST[MAXVL-1:128] (Unmodified)
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPACKSSDW, see Table 2-52, “Type E4NF Class Exception Conditions.”
EVEX-encoded VPACKSSWB, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts packed signed doubleword integers in the first and second source operands into packed unsigned word
integers using unsigned saturation to handle overflow conditions. If the signed doubleword value is beyond the
range of an unsigned word (that is, greater than FFFFH or less than 0000H), the saturated unsigned word integer
value of FFFFH or 0000H, respectively, is stored in the destination.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand is a
ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32-
bit memory location. The destination operand is a ZMM register, updated conditionally under the writemask k1.
Operation
PACKUSDW (Legacy SSE Instruction)
TMP[15:0] := (DEST[31:0] < 0) ? 0 : DEST[15:0];
DEST[15:0] := (DEST[31:0] > FFFFH) ? FFFFH : TMP[15:0] ;
TMP[31:16] := (DEST[63:32] < 0) ? 0 : DEST[47:32];
DEST[31:16] := (DEST[63:32] > FFFFH) ? FFFFH : TMP[31:16] ;
TMP[47:32] := (DEST[95:64] < 0) ? 0 : DEST[79:64];
DEST[47:32] := (DEST[95:64] > FFFFH) ? FFFFH : TMP[47:32] ;
TMP[63:48] := (DEST[127:96] < 0) ? 0 : DEST[111:96];
DEST[63:48] := (DEST[127:96] > FFFFH) ? FFFFH : TMP[63:48] ;
TMP[79:64] := (SRC[31:0] < 0) ? 0 : SRC[15:0];
DEST[79:64] := (SRC[31:0] > FFFFH) ? FFFFH : TMP[79:64] ;
TMP[95:80] := (SRC[63:32] < 0) ? 0 : SRC[47:32];
DEST[95:80] := (SRC[63:32] > FFFFH) ? FFFFH : TMP[95:80] ;
TMP[111:96] := (SRC[95:64] < 0) ? 0 : SRC[79:64];
DEST[111:96] := (SRC[95:64] > FFFFH) ? FFFFH : TMP[111:96] ;
TMP[127:112] := (SRC[127:96] < 0) ? 0 : SRC[111:96];
DEST[127:112] := (SRC[127:96] > FFFFH) ? FFFFH : TMP[127:112] ;
DEST[MAXVL-1:128] (Unmodified)
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts 4, 8, 16, or 32 signed word integers from the destination operand (first operand) and 4, 8, 16, or 32
signed word integers from the source operand (second operand) into 8, 16, 32 or 64 unsigned byte integers and
stores the result in the destination operand. (See Figure 4-6 for an example of the packing operation.) If a signed
Operation
PACKUSWB (With 64-bit Operands)
DEST[7:0] := SaturateSignedWordToUnsignedByte DEST[15:0];
DEST[15:8] := SaturateSignedWordToUnsignedByte DEST[31:16];
DEST[23:16] := SaturateSignedWordToUnsignedByte DEST[47:32];
DEST[31:24] := SaturateSignedWordToUnsignedByte DEST[63:48];
DEST[39:32] := SaturateSignedWordToUnsignedByte SRC[15:0];
DEST[47:40] := SaturateSignedWordToUnsignedByte SRC[31:16];
DEST[55:48] := SaturateSignedWordToUnsignedByte SRC[47:32];
DEST[63:56] := SaturateSignedWordToUnsignedByte SRC[63:48];
Flags Affected
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD add of the packed integers from the source operand (second operand) and the destination
operand (first operand), and stores the packed integer results in the destination operand. See Figure 9-4 in the
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a SIMD operation.
Overflow is handled with wraparound, as described in the following paragraphs.
The PADDB and VPADDB instructions add packed byte integers from the first source operand and second source
operand and store the packed integer results in the destination operand. When an individual result is too large to
Operation
PADDB (With 64-bit Operands)
DEST[7:0] := DEST[7:0] + SRC[7:0];
(* Repeat add operation for 2nd through 7th byte *)
DEST[63:56] := DEST[63:56] + SRC[63:56];
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SRC1[i+7:i] + SRC2[i+7:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SRC1[i+15:i] + SRC2[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPADDD/Q, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX-encoded VPADDB/W, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD add of the packed signed integers from the source operand (second operand) and the destination
operand (first operand), and stores the packed integer results in the destination operand. See Figure 9-4 in the
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a SIMD operation.
Overflow is handled with signed saturation, as described in the following paragraphs.
(V)PADDSB performs a SIMD add of the packed signed integers with saturation from the first source operand and
second source operand and stores the packed integer results in the destination operand. When an individual byte
result is beyond the range of a signed byte integer (that is, greater than 7FH or less than 80H), the saturated value
of 7FH or 80H, respectively, is written to the destination operand.
(V)PADDSW performs a SIMD add of the packed signed word integers with saturation from the first source operand
and second source operand and stores the packed integer results in the destination operand. When an individual
word result is beyond the range of a signed word integer (that is, greater than 7FFFH or less than 8000H), the satu-
rated value of 7FFFH or 8000H, respectively, is written to the destination operand.
EVEX encoded versions: The first source operand is an ZMM/YMM/XMM register. The second source operand is an
ZMM/YMM/XMM register or a memory location. The destination operand is an ZMM/YMM/XMM register.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding register destination are zeroed.
128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the
upper bits (MAXVL-1:128) of the corresponding register destination are unmodified.
Operation
PADDSB (With 64-bit Operands)
DEST[7:0] := SaturateToSignedByte(DEST[7:0] + SRC (7:0]);
(* Repeat add operation for 2nd through 7th bytes *)
DEST[63:56] := SaturateToSignedByte(DEST[63:56] + SRC[63:56] );
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateToSignedByte (SRC1[i+7:i] + SRC2[i+7:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
PADDSW (with 64-bit operands)
DEST[15:0] := SaturateToSignedWord(DEST[15:0] + SRC[15:0] );
(* Repeat add operation for 2nd and 7th words *)
DEST[63:48] := SaturateToSignedWord(DEST[63:48] + SRC[63:48] );
PADDSW (with 128-bit operands)
DEST[15:0] := SaturateToSignedWord (DEST[15:0] + SRC[15:0]);
(* Repeat add operation for 2nd through 7th words *)
DEST[127:112] := SaturateToSignedWord (DEST[127:112] + SRC[127:112]);
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SaturateToSignedWord (SRC1[i+15:i] + SRC2[i+15:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
Flags Affected
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
VEX.256.66.0F.WIG DC /r B V/V AVX2 Add packed unsigned byte integers from ymm2,
VPADDUSB ymm1, ymm2, ymm3/m256 and ymm3/m256 and store the saturated
results in ymm1.
VEX.256.66.0F.WIG DD /r B V/V AVX2 Add packed unsigned word integers from ymm2,
VPADDUSW ymm1, ymm2, ymm3/m256 and ymm3/m256 and store the saturated
results in ymm1.
EVEX.128.66.0F.WIG DC /r C V/V (AVX512VL AND Add packed unsigned byte integers from xmm2,
VPADDUSB xmm1 {k1}{z}, xmm2, AVX512BW) OR and xmm3/m128 and store the saturated
xmm3/m128 AVX10.12 results in xmm1 under writemask k1.
EVEX.256.66.0F.WIG DC /r C V/V (AVX512VL AND Add packed unsigned byte integers from ymm2,
VPADDUSB ymm1 {k1}{z}, ymm2, AVX512BW) OR and ymm3/m256 and store the saturated
ymm3/m256 AVX10.12 results in ymm1 under writemask k1.
EVEX.512.66.0F.WIG DC /r C V/V AVX512BW Add packed unsigned byte integers from zmm2,
VPADDUSB zmm1 {k1}{z}, zmm2, OR AVX10.12 and zmm3/m512 and store the saturated
zmm3/m512 results in zmm1 under writemask k1.
EVEX.128.66.0F.WIG DD /r C V/V (AVX512VL AND Add packed unsigned word integers from xmm2,
VPADDUSW xmm1 {k1}{z}, xmm2, AVX512BW) OR and xmm3/m128 and store the saturated
xmm3/m128 AVX10.12 results in xmm1 under writemask k1.
EVEX.256.66.0F.WIG DD /r C V/V (AVX512VL AND Add packed unsigned word integers from ymm2,
VPADDUSW ymm1 {k1}{z}, ymm2, AVX512BW) OR and ymm3/m256 and store the saturated
ymm3/m256 AVX10.12 results in ymm1 under writemask k1.
EVEX.512.66.0F.WIG DD /r C V/V AVX512BW Add packed unsigned word integers from zmm2,
VPADDUSW zmm1 {k1}{z}, zmm2, OR AVX10.12 and zmm3/m512 and store the saturated
zmm3/m512 results in zmm1 under writemask k1.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD add of the packed unsigned integers from the source operand (second operand) and the destina-
tion operand (first operand), and stores the packed integer results in the destination operand. See Figure 9-4 in the
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a SIMD operation.
Overflow is handled with unsigned saturation, as described in the following paragraphs.
(V)PADDUSB performs a SIMD add of the packed unsigned integers with saturation from the first source operand
and second source operand and stores the packed integer results in the destination operand. When an individual
byte result is beyond the range of an unsigned byte integer (that is, greater than FFH), the saturated value of FFH
is written to the destination operand.
(V)PADDUSW performs a SIMD add of the packed unsigned word integers with saturation from the first source
operand and second source operand and stores the packed integer results in the destination operand. When an
individual word result is beyond the range of an unsigned word integer (that is, greater than FFFFH), the saturated
value of FFFFH is written to the destination operand.
EVEX encoded versions: The first source operand is an ZMM/YMM/XMM register. The second source operand is an
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination is an ZMM/YMM/XMM register.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register.
VEX.128 encoded version: The first source operand is an XMM register. The second source operand is an XMM
register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding destination register destination are zeroed.
128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the
upper bits (MAXVL-1:128) of the corresponding register destination are unmodified.
Operation
PADDUSB (With 64-bit Operands)
DEST[7:0] := SaturateToUnsignedByte(DEST[7:0] + SRC (7:0] );
(* Repeat add operation for 2nd through 7th bytes *)
DEST[63:56] := SaturateToUnsignedByte(DEST[63:56] + SRC[63:56]
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SaturateToUnsignedByte (SRC1[i+7:i] + SRC2[i+7:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SaturateToUnsignedWord (SRC1[i+15:i] + SRC2[i+15:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
127 0 127 0
SRC1 SRC2
Imm8[7:0]*8
255 128 255 128
SRC1 SRC2
Imm8[7:0]*8
DEST DEST
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := TMP_DEST[i+7:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a bitwise logical AND operation on the first source operand and second source operand and stores the
result in the destination operand. Each bit of the result is set to 1 if the corresponding bits of the first and second
operands are 1, otherwise it is set to 0.
Operation
PAND (64-bit Operand)
DEST := DEST AND SRC
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a bitwise logical NOT operation on the first source operand, then performs bitwise AND with second
source operand and stores the result in the destination operand. Each bit of the result is set to 1 if the corre-
sponding bit in the first operand is 0 and the corresponding bit in the second operand is 1, otherwise it is set to 0.
Operation
PANDN (64-bit Operand)
DEST := NOT(DEST) AND SRC
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX.128.66.0F.WIG E0 /r C V/V (AVX512VL AND Average packed unsigned byte integers from
VPAVGB xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512BW) OR xmm2, and xmm3/m128 with rounding and
AVX10.12 store to xmm1 under writemask k1.
EVEX.256.66.0F.WIG E0 /r C V/V (AVX512VL AND Average packed unsigned byte integers from
VPAVGB ymm1 {k1}{z}, ymm2, ymm3/m256 AVX512BW) OR ymm2, and ymm3/m256 with rounding and
AVX10.12 store to ymm1 under writemask k1.
EVEX.512.66.0F.WIG E0 /r C V/V AVX512BW Average packed unsigned byte integers from
VPAVGB zmm1 {k1}{z}, zmm2, zmm3/m512 OR AVX10.12 zmm2, and zmm3/m512 with rounding and
store to zmm1 under writemask k1.
EVEX.128.66.0F.WIG E3 /r C V/V (AVX512VL AND Average packed unsigned word integers from
VPAVGW xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512BW) OR xmm2, xmm3/m128 with rounding to xmm1
AVX10.12 under writemask k1.
EVEX.256.66.0F.WIG E3 /r C V/V (AVX512VL AND Average packed unsigned word integers from
VPAVGW ymm1 {k1}{z}, ymm2, ymm3/m256 AVX512BW) OR ymm2, ymm3/m256 with rounding to ymm1
AVX10.12 under writemask k1.
EVEX.512.66.0F.WIG E3 /r C V/V AVX512BW Average packed unsigned word integers from
VPAVGW zmm1 {k1}{z}, zmm2, zmm3/m512 OR AVX10.12 zmm2, zmm3/m512 with rounding to zmm1
under writemask k1.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2B, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD average of the packed unsigned integers from the source operand (second operand) and the
destination operand (first operand), and stores the results in the destination operand. For each corresponding pair
of data elements in the first and second operands, the elements are added together, a 1 is added to the temporary
sum, and that result is shifted right one bit position.
The (V)PAVGB instruction operates on packed unsigned bytes and the (V)PAVGW instruction operates on packed
unsigned words.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The
destination operand can be an MMX technology register.
128-bit Legacy SSE version: The first source operand is an XMM register. The second operand can be an XMM
register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the
upper bits (MAXVL-1:128) of the corresponding register destination are unmodified.
EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM
register or a 512-bit memory location. The destination operand is a ZMM register.
VEX.256 and EVEX.256 encoded versions: The first source operand is a YMM register. The second source operand
is a YMM register or a 256-bit memory location. The destination operand is a YMM register.
VEX.128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand
is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (MAXVL-
1:128) of the corresponding register destination are zeroed.
Operation
PAVGB (With 64-bit Operands)
DEST[7:0] := (SRC[7:0] + DEST[7:0] + 1) >> 1; (* Temp sum before shifting is 9 bits *)
(* Repeat operation performed for bytes 2 through 6 *)
DEST[63:56] := (SRC[63:56] + DEST[63:56] + 1) >> 1;
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a carry-less multiplication of two quadwords, selected from the first source and second source operand
according to the value of the immediate byte. Bits 4 and 0 are used to select which 64-bit half of each operand to
use according to Table 4-13, other bits of the immediate byte are ignored.
The EVEX encoded form of this instruction does not support memory fault suppression.
The first source operand and the destination operand are the same and must be a ZMM/YMM/XMM register. The
second source operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location. Bits (VL_MAX-
1:128) of the corresponding YMM destination register remain unchanged.
Compilers and assemblers may implement the following pseudo-op syntax to simplify programming and emit the
required encoding for imm8.
Operation
define PCLMUL128(X,Y): // helper function
FOR i := 0 to 63:
TMP [ i ] := X[ 0 ] and Y[ i ]
FOR j := 1 to i:
TMP [ i ] := TMP [ i ] xor (X[ j ] and Y[ i - j ])
DEST[ i ] := TMP[ i ]
FOR i := 64 to 126:
TMP [ i ] := 0
FOR j := i - 63 to 63:
TMP [ i ] := TMP [ i ] xor (X[ j ] and Y[ i - j ])
DEST[ i ] := TMP[ i ]
DEST[127] := 0;
RETURN DEST // 128b vector
Other Exceptions
See Table 2-21, “Type 4 Class Exception Conditions,” additionally:
#UD If VEX.L = 1.
EVEX-encoded: See Table 2-52, “Type E4NF Class Exception Conditions.”
EVEX.128.66.0F.W0 76 /r C V/V (AVX512VL AND Compare Equal between int32 vector xmm2
VPCMPEQD k1 {k2}, xmm2, AVX512F) OR and int32 vector xmm3/m128/m32bcst, and
xmm3/m128/m32bcst AVX10.12 set vector mask k1 to reflect the
zero/nonzero status of each element of the
result, under writemask.
EVEX.256.66.0F.W0 76 /r C V/V (AVX512VL AND Compare Equal between int32 vector ymm2
VPCMPEQD k1 {k2}, ymm2, AVX512F) OR and int32 vector ymm3/m256/m32bcst, and
ymm3/m256/m32bcst AVX10.12 set vector mask k1 to reflect the
zero/nonzero status of each element of the
result, under writemask.
EVEX.512.66.0F.W0 76 /r C V/V AVX512F Compare Equal between int32 vectors in
VPCMPEQD k1 {k2}, zmm2, OR AVX10.12 zmm2 and zmm3/m512/m32bcst, and set
zmm3/m512/m32bcst destination k1 according to the comparison
results under writemask k2.
EVEX.128.66.0F.WIG 74 /r D V/V (AVX512VL AND Compare packed bytes in xmm3/m128 and
VPCMPEQB k1 {k2}, xmm2, xmm3 /m128 AVX512BW) OR xmm2 for equality and set vector mask k1 to
AVX10.12 reflect the zero/nonzero status of each
element of the result, under writemask.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare for equality of the packed bytes, words, or doublewords in the destination operand (first
operand) and the source operand (second operand). If a pair of data elements is equal, the corresponding data
element in the destination operand is set to all 1s; otherwise, it is set to all 0s.
The (V)PCMPEQB instruction compares the corresponding bytes in the destination and source operands; the
(V)PCMPEQW instruction compares the corresponding words in the destination and source operands; and the
(V)PCMPEQD instruction compares the corresponding doublewords in the destination and source operands.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The
destination operand can be an MMX technology register.
Operation
PCMPEQB (With 64-bit Operands)
IF DEST[7:0] = SRC[7:0]
THEN DEST[7:0) := FFH;
ELSE DEST[7:0] := 0; FI;
(* Continue comparison of 2nd through 7th bytes in DEST and SRC *)
IF DEST[63:56] = SRC[63:56]
THEN DEST[63:56] := FFH;
ELSE DEST[63:56] := 0; FI;
FOR j := 0 TO KL-1
i := j * 8
IF k2[j] OR *no writemask*
THEN
/* signed comparison */
CMP := SRC1[i+7:i] == SRC2[i+7:i];
IF CMP = TRUE
THEN DEST[j] := 1;
ELSE DEST[j] := 0; FI;
ELSE DEST[j] := 0 ; zeroing-masking onlyFI;
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0
Flags Affected
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPCMPEQD, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX-encoded VPCMPEQB/W, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs an SIMD compare for equality of the packed quadwords in the destination operand (first operand) and the
source operand (second operand). If a pair of data elements is equal, the corresponding data element in the desti-
nation is set to all 1s; otherwise, it is set to 0s.
128-bit Legacy SSE version: The second source operand can be an XMM register or a 128-bit memory location. The
first source and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM destination
register remain unchanged.
VEX.128 encoded version: The second source operand can be an XMM register or a 128-bit memory location. The
first source and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM register
are zeroed.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register.
EVEX encoded VPCMPEQQ: The first source operand (second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand (first operand) is a mask register updated
according to the writemask k2.
Flags Affected
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPCMPEQQ, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX.128.66.0F.W0 66 /r C V/V (AVX512VL AND Compare Greater between int32 vector xmm2 and
VPCMPGTD k1 {k2}, xmm2, AVX512F) OR int32 vector xmm3/m128/m32bcst, and set
xmm3/m128/m32bcst AVX10.12 vector mask k1 to reflect the zero/nonzero status
of each element of the result, under writemask.
EVEX.256.66.0F.W0 66 /r C V/V (AVX512VL AND Compare Greater between int32 vector ymm2 and
VPCMPGTD k1 {k2}, ymm2, AVX512F) OR int32 vector ymm3/m256/m32bcst, and set
ymm3/m256/m32bcst AVX10.12 vector mask k1 to reflect the zero/nonzero status
of each element of the result, under writemask.
EVEX.512.66.0F.W0 66 /r C V/V AVX512F Compare Greater between int32 elements in
VPCMPGTD k1 {k2}, zmm2, OR AVX10.12 zmm2 and zmm3/m512/m32bcst, and set
zmm3/m512/m32bcst destination k1 according to the comparison results
under writemask. k2.
EVEX.128.66.0F.WIG 64 /r D V/V (AVX512VL AND Compare packed signed byte integers in xmm2
VPCMPGTB k1 {k2}, xmm2, xmm3/m128 AVX512BW) OR and xmm3/m128 for greater than, and set vector
AVX10.12 mask k1 to reflect the zero/nonzero status of each
element of the result, under writemask.
EVEX.256.66.0F.WIG 64 /r D V/V (AVX512VL AND Compare packed signed byte integers in ymm2
VPCMPGTB k1 {k2}, ymm2, ymm3/m256 AVX512BW) OR and ymm3/m256 for greater than, and set vector
AVX10.12 mask k1 to reflect the zero/nonzero status of each
element of the result, under writemask.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs an SIMD signed compare for the greater value of the packed byte, word, or doubleword integers in the
destination operand (first operand) and the source operand (second operand). If a data element in the destination
operand is greater than the corresponding date element in the source operand, the corresponding data element in
the destination operand is set to all 1s; otherwise, it is set to all 0s.
The PCMPGTB instruction compares the corresponding signed byte integers in the destination and source oper-
ands; the PCMPGTW instruction compares the corresponding signed word integers in the destination and source
operands; and the PCMPGTD instruction compares the corresponding signed doubleword integers in the destina-
tion and source operands.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE instructions: The source operand can be an MMX technology register or a 64-bit memory location. The
destination operand can be an MMX technology register.
128-bit Legacy SSE version: The second source operand can be an XMM register or a 128-bit memory location. The
first source operand and destination operand are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM
destination register remain unchanged.
Operation
PCMPGTB (With 64-bit Operands)
IF DEST[7:0] > SRC[7:0]
THEN DEST[7:0) := FFH;
ELSE DEST[7:0] := 0; FI;
(* Continue comparison of 2nd through 7th bytes in DEST and SRC *)
IF DEST[63:56] > SRC[63:56]
THEN DEST[63:56] := FFH;
ELSE DEST[63:56] := 0; FI;
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPCMPGTD, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX-encoded VPCMPGTB/W, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs an SIMD signed compare for the packed quadwords in the destination operand (first operand) and the
source operand (second operand). If the data element in the first (destination) operand is greater than the
corresponding element in the second (source) operand, the corresponding data element in the destination is set
to all 1s; otherwise, it is set to 0s.
128-bit Legacy SSE version: The second source operand can be an XMM register or a 128-bit memory location. The
first source operand and destination operand are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM
destination register remain unchanged.
VEX.128 encoded version: The second source operand can be an XMM register or a 128-bit memory location. The
first source operand and destination operand are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM
register are zeroed.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM register
or a 256-bit memory location. The destination operand is a YMM register.
EVEX encoded VPCMPGTD/Q: The first source operand (second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand (first operand) is a mask register updated
according to the writemask k2.
Flags Affected
None.
NOTES:
1. In 64-bit mode, VEX.W1 is ignored for VPEXTRB (similar to legacy REX.W=1 prefix in PEXTRB).
2. VEX.W/EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used.
3. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Extract a byte/dword/qword integer value from the source XMM register at a byte/dword/qword offset determined
from imm8[3:0]. The destination can be a register or byte/dword/qword memory location. If the destination is a
register, the upper bits of the register are zero extended.
In legacy non-VEX encoded version and if the destination operand is a register, the default operand size in 64-bit
mode for PEXTRB/PEXTRD is 64 bits, the bits above the least significant byte/dword data are filled with zeros.
PEXTRQ is not encodable in non-64-bit modes and requires REX.W in 64-bit mode.
Note: In VEX.128 encoded versions, VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise the
instruction will #UD. In EVEX.128 encoded versions, EVEX.vvvv is reserved and must be 1111b, EVEX.L”L must be
Operation
CASE of
PEXTRB: SEL := COUNT[3:0];
TEMP := (Src >> SEL*8) AND FFH;
IF (DEST = Mem8)
THEN
Mem8 := TEMP[7:0];
ELSE IF (64-Bit Mode and 64-bit register selected)
THEN
R64[7:0] := TEMP[7:0];
r64[63:8] := ZERO_FILL; };
ELSE
R32[7:0] := TEMP[7:0];
r32[31:8] := ZERO_FILL; };
FI;
PEXTRD:SEL := COUNT[1:0];
TEMP := (Src >> SEL*32) AND FFFF_FFFFH;
DEST := TEMP;
PEXTRQ: SEL := COUNT[0];
TEMP := (Src >> SEL*64);
DEST := TEMP;
EASC:
VPEXTRTD/VPEXTRQ
IF (64-Bit Mode and 64-bit dest operand)
THEN
Src_Offset := imm8[0]
r64/m64 := (Src >> Src_Offset * 64)
ELSE
Src_Offset := imm8[1:0]
r32/m32 := ((Src >> Src_Offset *32) AND 0FFFFFFFFh);
FI
VPEXTRB ( dest=m8)
SRC_Offset := imm8[3:0]
Mem8 := (Src >> Src_Offset*8)
VPEXTRB ( dest=reg)
IF (64-Bit Mode )
THEN
SRC_Offset := imm8[3:0]
DEST[7:0] := ((Src >> Src_Offset*8) AND 0FFh)
DEST[63:8] := ZERO_FILL;
ELSE
SRC_Offset := imm8[3:0];
DEST[7:0] := ((Src >> Src_Offset*8) AND 0FFh);
DEST[31:8] := ZERO_FILL;
FI
Flags Affected
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”
Additionally:
#UD If VEX.L = 1 or EVEX.L’L > 0.
If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. In 64-bit mode, VEX.W1 is ignored for VPEXTRW (similar to legacy REX.W=1 prefix in PEXTRW).
3. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Copies the word in the source operand (second operand) specified by the count operand (third operand) to the
destination operand (first operand). The source operand can be an MMX technology register or an XMM register.
The destination operand can be the low word of a general-purpose register or a 16-bit memory address. The count
operand is an 8-bit immediate. When specifying a word location in an MMX technology register, the 2 least-signifi-
cant bits of the count operand specify the location; for an XMM register, the 3 least-significant bits specify the loca-
tion. The content of the destination register above bit 16 is cleared (set to all 0s).
In 64-bit mode, using a REX prefix in the form of REX.R permits this instruction to access additional registers
(XMM8-XMM15, R8-15). If the destination operand is a general-purpose register, the default operand size is 64-bits
in 64-bit mode.
Operation
IF (DEST = Mem16)
THEN
SEL := COUNT[2:0];
TEMP := (Src >> SEL*16) AND FFFFH;
Mem16 := TEMP[15:0];
ELSE IF (64-Bit Mode and destination is a general-purpose register)
THEN
FOR (PEXTRW instruction with 64-bit source operand)
{ SEL := COUNT[1:0];
TEMP := (SRC >> (SEL ∗ 16)) AND FFFFH;
r64[15:0] := TEMP[15:0];
r64[63:16] := ZERO_FILL; };
FOR (PEXTRW instruction with 128-bit source operand)
{ SEL := COUNT[2:0];
TEMP := (SRC >> (SEL ∗ 16)) AND FFFFH;
r64[15:0] := TEMP[15:0];
r64[63:16] := ZERO_FILL; }
ELSE
FOR (PEXTRW instruction with 64-bit source operand)
{ SEL := COUNT[1:0];
TEMP := (SRC >> (SEL ∗ 16)) AND FFFFH;
r32[15:0] := TEMP[15:0];
r32[31:16] := ZERO_FILL; };
FOR (PEXTRW instruction with 128-bit source operand)
{ SEL := COUNT[2:0];
TEMP := (SRC >> (SEL ∗ 16)) AND FFFFH;
r32[15:0] := TEMP[15:0];
r32[31:16] := ZERO_FILL; };
FI;
FI;
VPEXTRW ( dest=m16)
SRC_Offset := imm8[2:0]
Mem16 := (Src >> Src_Offset*16)
VPEXTRW ( dest=reg)
IF (64-Bit Mode )
THEN
SRC_Offset := imm8[2:0]
DEST[15:0] := ((Src >> Src_Offset*16) AND 0FFFFh)
DEST[63:16] := ZERO_FILL;
ELSE
SRC_Offset := imm8[2:0]
DEST[15:0] := ((Src >> Src_Offset*16) AND 0FFFFh)
DEST[31:16] := ZERO_FILL;
FI
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”
Additionally:
#UD If VEX.L = 1 or EVEX.L’L > 0.
If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
NOTES:
1. In 64-bit mode, VEX.W1 is ignored for VPINSRB (similar to legacy REX.W=1 prefix with PINSRB).
2. VEX.W/EVEX.W in non-64 bit is ignored; the instructions behaves as if the W0 version is used.
3. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Copies a byte/dword/qword from the source operand (second operand) and inserts it in the destination operand
(first operand) at the location specified with the count operand (third operand). (The other elements in the desti-
nation register are left untouched.) The source operand can be a general-purpose register or a memory location.
(When the source operand is a general-purpose register, PINSRB copies the low byte of the register.) The destina-
Operation
CASE OF
PINSRB: SEL := COUNT[3:0];
MASK := (0FFH << (SEL * 8));
TEMP := (((SRC[7:0] << (SEL *8)) AND MASK);
PINSRD: SEL := COUNT[1:0];
MASK := (0FFFFFFFFH << (SEL * 32));
TEMP := (((SRC << (SEL *32)) AND MASK) ;
PINSRQ: SEL := COUNT[0]
MASK := (0FFFFFFFFFFFFFFFFH << (SEL * 64));
TEMP := (((SRC << (SEL *64)) AND MASK) ;
ESAC;
DEST := ((DEST AND NOT MASK) OR TEMP);
Flags Affected
None.
66 0F C4 /r ib A V/V SSE2 Move the low word of r32 or from m16 into
PINSRW xmm, r32/m16, imm8 xmm at the word position specified by imm8.
VEX.128.66.0F.W0 C4 /r ib B V2/V AVX Insert the word from r32/m16 at the offset
VPINSRW xmm1, xmm2, r32/m16, imm8 indicated by imm8 into the value from xmm2
and store result in xmm1.
EVEX.128.66.0F.WIG C4 /r ib C V/V AVX512BW OR Insert the word from r32/m16 at the offset
VPINSRW xmm1, xmm2, r32/m16, imm8 AVX10.13 indicated by imm8 into the value from xmm2
and store result in xmm1.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. In 64-bit mode, VEX.W1 is ignored for VPINSRW (similar to legacy REX.W=1 prefix in PINSRW).
3. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the
processor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported
vector width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Three operand MMX and SSE instructions:
Copies a word from the source operand and inserts it in the destination operand at the location specified with the
count operand. (The other words in the destination register are left untouched.) The source operand can be a
general-purpose register or a 16-bit memory location. (When the source operand is a general-purpose register, the
low word of the register is copied.) The destination operand can be an MMX technology register or an XMM register.
The count operand is an 8-bit immediate. When specifying a word location in an MMX technology register, the 2
least-significant bits of the count operand specify the location; for an XMM register, the 3 least-significant bits
specify the location.
Bits (MAXVL-1:128) of the corresponding YMM destination register remain unchanged.
Four operand AVX and AVX-512 instructions:
Combines a word from the first source operand with the second source operand, and inserts it in the destination
operand at the location specified with the count operand. The second source operand can be a general-purpose
register or a 16-bit memory location. (When the source operand is a general-purpose register, the low word of the
register is copied.) The first source and destination operands are XMM registers. The count operand is an 8-bit
immediate. When specifying a word location, the 3 least-significant bits specify the location.
Bits (MAXVL-1:128) of the destination YMM register are zeroed. VEX.L/EVEX.L’L must be 0, otherwise the instruc-
tion will #UD.
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
EVEX-encoded instruction, see Table 2-22, “Type 5 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-59, “Type E9NF Class Exception Conditions.”
Additionally:
#UD If VEX.L = 1 or EVEX.L’L > 0.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
(V)PMADDUBSW multiplies vertically each unsigned byte of the destination operand (first operand) with the corre-
sponding signed byte of the source operand (second operand), producing intermediate signed 16-bit integers.
Each adjacent pair of signed words is added and the saturated result is packed to the destination operand. For
example, the lowest-order bytes (bits 7-0) in the source and destination operands are multiplied and the interme-
diate signed word result is added with the corresponding intermediate result from the 2nd lowest-order bytes (bits
15-8) of the operands; the sign-saturated result is stored in the lowest word of the destination register (15-0). The
same operation is performed on the other pairs of adjacent bytes. Both operands can be MMX register or XMM
registers. When the source operand is a 128-bit memory operand, the operand must be aligned on a 16-byte
boundary or a general-protection exception (#GP) will be generated.
In 64-bit mode and not encoded with VEX/EVEX, use the REX prefix to access XMM8-XMM15.
PMADDUBSW—Multiply and Add Packed Signed and Unsigned Bytes Vol. 2B 4-305
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register remain unchanged.
VEX.128 and EVEX.128 encoded versions: The first source and destination operands are XMM registers. The
second source operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding
destination register are zeroed.
VEX.256 and EVEX.256 encoded versions: The second source operand can be an YMM register or a 256-bit memory
location. The first source and destination operands are YMM registers. Bits (MAXVL-1:256) of the corresponding
ZMM register are zeroed.
EVEX.512 encoded version: The second source operand can be an ZMM register or a 512-bit memory location. The
first source and destination operands are ZMM registers.
Operation
PMADDUBSW (With 64-bit Operands)
DEST[15-0] = SaturateToSignedWord(SRC[15-8]*DEST[15-8]+SRC[7-0]*DEST[7-0]);
DEST[31-16] = SaturateToSignedWord(SRC[31-24]*DEST[31-24]+SRC[23-16]*DEST[23-16]);
DEST[47-32] = SaturateToSignedWord(SRC[47-40]*DEST[47-40]+SRC[39-32]*DEST[39-32]);
DEST[63-48] = SaturateToSignedWord(SRC[63-56]*DEST[63-56]+SRC[55-48]*DEST[55-48]);
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := SaturateToSignedWord(SRC2[i+15:i+8]* SRC1[i+15:i+8] + SRC2[i+7:i]*SRC1[i+7:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
PMADDUBSW—Multiply and Add Packed Signed and Unsigned Bytes Vol. 2B 4-306
Intel C/C++ Compiler Intrinsic Equivalents
VPMADDUBSW __m512i _mm512_maddubs_epi16( __m512i a, __m512i b);
VPMADDUBSW __m512i _mm512_mask_maddubs_epi16(__m512i s, __mmask32 k, __m512i a, __m512i b);
VPMADDUBSW __m512i _mm512_maskz_maddubs_epi16( __mmask32 k, __m512i a, __m512i b);
VPMADDUBSW __m256i _mm256_mask_maddubs_epi16(__m256i s, __mmask16 k, __m256i a, __m256i b);
VPMADDUBSW __m256i _mm256_maskz_maddubs_epi16( __mmask16 k, __m256i a, __m256i b);
VPMADDUBSW __m128i _mm_mask_maddubs_epi16(__m128i s, __mmask8 k, __m128i a, __m128i b);
VPMADDUBSW __m128i _mm_maskz_maddubs_epi16( __mmask8 k, __m128i a, __m128i b);
PMADDUBSW __m64 _mm_maddubs_pi16 (__m64 a, __m64 b)
(V)PMADDUBSW __m128i _mm_maddubs_epi16 (__m128i a, __m128i b)
VPMADDUBSW __m256i _mm256_maddubs_epi16 (__m256i a, __m256i b)
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
PMADDUBSW—Multiply and Add Packed Signed and Unsigned Bytes Vol. 2B 4-307
PMADDWD—Multiply and Add Packed Integers
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F F5 /r1 A V/V MMX Multiply the packed words in mm by the packed
PMADDWD mm, mm/m64 words in mm/m64, add adjacent doubleword
results, and store in mm.
66 0F F5 /r A V/V SSE2 Multiply the packed word integers in xmm1 by
PMADDWD xmm1, xmm2/m128 the packed word integers in xmm2/m128, add
adjacent doubleword results, and store in
xmm1.
VEX.128.66.0F.WIG F5 /r B V/V AVX Multiply the packed word integers in xmm2 by
VPMADDWD xmm1, xmm2, xmm3/m128 the packed word integers in xmm3/m128, add
adjacent doubleword results, and store in
xmm1.
VEX.256.66.0F.WIG F5 /r B V/V AVX2 Multiply the packed word integers in ymm2 by
VPMADDWD ymm1, ymm2, ymm3/m256 the packed word integers in ymm3/m256, add
adjacent doubleword results, and store in
ymm1.
EVEX.128.66.0F.WIG F5 /r C V/V (AVX512VL AND Multiply the packed word integers in xmm2 by
VPMADDWD xmm1 {k1}{z}, xmm2, AVX512BW) OR the packed word integers in xmm3/m128, add
xmm3/m128 AVX10.12 adjacent doubleword results, and store in
xmm1 under writemask k1.
EVEX.256.66.0F.WIG F5 /r C V/V (AVX512VL AND Multiply the packed word integers in ymm2 by
VPMADDWD ymm1 {k1}{z}, ymm2, AVX512BW) OR the packed word integers in ymm3/m256, add
ymm3/m256 AVX10.12 adjacent doubleword results, and store in
ymm1 under writemask k1.
EVEX.512.66.0F.WIG F5 /r C V/V AVX512BW Multiply the packed word integers in zmm2 by
VPMADDWD zmm1 {k1}{z}, zmm2, OR AVX10.12 the packed word integers in zmm3/m512, add
zmm3/m512 adjacent doubleword results, and store in
zmm1 under writemask k1.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Multiplies the individual signed words of the destination operand (first operand) by the corresponding signed words
of the source operand (second operand), producing temporary signed, doubleword results. The adjacent double-
word results are then summed and stored in the destination operand. For example, the corresponding low-order
words (15-0) and (31-16) in the source and destination operands are multiplied by one another and the double-
word results are added together and stored in the low doubleword of the destination register (31-0). The same
SRC X3 X2 X1 X0
DEST Y3 Y2 Y1 Y0
TEMP X3 ∗ Y3 X2 ∗ Y2 X1 ∗ Y1 X0 ∗ Y0
Operation
PMADDWD (With 64-bit Operands)
DEST[31:0] := (DEST[15:0] ∗ SRC[15:0]) + (DEST[31:16] ∗ SRC[31:16]);
DEST[63:32] := (DEST[47:32] ∗ SRC[47:32]) + (DEST[63:48] ∗ SRC[63:48]);
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := (SRC2[i+31:i+16]* SRC1[i+31:i+16]) + (SRC2[i+15:i]*SRC1[i+15:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] = 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed signed byte, word, dword or qword integers in the second source operand
and the first source operand and returns the maximum value for each pair of integers to the destination operand.
Legacy SSE version PMAXSW: The source operand can be an MMX technology register or a 64-bit memory location.
The destination operand can be an MMX technology register.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destina-
tion register remain unchanged.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register are zeroed.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers. Bits (MAXVL-1:256) of the corresponding destination
register are zeroed.
Operation
PMAXSW (64-bit Operands)
IF DEST[15:0] > SRC[15:0]) THEN
DEST[15:0] := DEST[15:0];
ELSE
DEST[15:0] := SRC[15:0]; FI;
(* Repeat operation for 2nd and 3rd words in source and destination operands *)
IF DEST[63:48] > SRC[63:48]) THEN
DEST[63:48] := DEST[63:48];
ELSE
DEST[63:48] := SRC[63:48]; FI;
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPMAXSD/Q, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX-encoded VPMAXSB/W, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed unsigned byte, word integers in the second source operand and the first
source operand and returns the maximum value for each pair of integers to the destination operand.
Legacy SSE version PMAXUB: The source operand can be an MMX technology register or a 64-bit memory location.
The destination operand can be an MMX technology register.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register remain unchanged.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register are zeroed.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register; The second source operand is a
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is conditionally updated
based on writemask k1.
Operation
PMAXUB (64-bit Operands)
IF DEST[7:0] > SRC[17:0]) THEN
DEST[7:0] := DEST[7:0];
ELSE
DEST[7:0] := SRC[7:0]; FI;
(* Repeat operation for 2nd through 7th bytes in source and destination operands *)
IF DEST[63:56] > SRC[63:56]) THEN
DEST[63:56] := DEST[63:56];
ELSE
DEST[63:56] := SRC[63:56]; FI;
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed unsigned dword or qword integers in the second source operand and the
first source operand and returns the maximum value for each pair of integers to the destination operand.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register remain unchanged.
Operation
PMAXUD (128-bit Legacy SSE Version)
IF DEST[31:0] >SRC[31:0] THEN
DEST[31:0] := DEST[31:0];
ELSE
DEST[31:0] := SRC[31:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
IF DEST[127:96] >SRC[127:96] THEN
DEST[127:96] := DEST[127:96];
ELSE
DEST[127:96] := SRC[127:96]; FI;
DEST[MAXVL-1:128] (Unmodified)
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed signed byte, word, or dword integers in the second source operand and
the first source operand and returns the minimum value for each pair of integers to the destination operand.
Legacy SSE version PMINSW: The source operand can be an MMX technology register or a 64-bit memory location.
The destination operand can be an MMX technology register.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register remain unchanged.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register are zeroed.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register; The second source operand is a
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is conditionally updated
based on writemask k1.
Operation
PMINSW (64-bit Operands)
IF DEST[15:0] < SRC[15:0] THEN
DEST[15:0] := DEST[15:0];
ELSE
DEST[15:0] := SRC[15:0]; FI;
(* Repeat operation for 2nd and 3rd words in source and destination operands *)
IF DEST[63:48] < SRC[63:48] THEN
DEST[63:48] := DEST[63:48];
ELSE
DEST[63:48] := SRC[63:48]; FI;
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#MF (64-bit operations only) If there is a pending x87 FPU exception.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed signed dword or qword integers in the second source operand and the first
source operand and returns the minimum value for each pair of integers to the destination operand.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register remain unchanged.
Operation
PMINSD (128-bit Legacy SSE Version)
IF DEST[31:0] < SRC[31:0] THEN
DEST[31:0] := DEST[31:0];
ELSE
DEST[31:0] := SRC[31:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
IF DEST[127:96] < SRC[127:96] THEN
DEST[127:96] := DEST[127:96];
ELSE
DEST[127:96] := SRC[127:96]; FI;
DEST[MAXVL-1:128] (Unmodified)
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed unsigned byte or word integers in the second source operand and the first
source operand and returns the minimum value for each pair of integers to the destination operand.
Legacy SSE version PMINUB: The source operand can be an MMX technology register or a 64-bit memory location.
The destination operand can be an MMX technology register.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register remain unchanged.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register are zeroed.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register; The second source operand is a
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is conditionally updated
based on writemask k1.
Operation
PMINUB (64-bit Operands)
IF DEST[7:0] < SRC[17:0] THEN
DEST[7:0] := DEST[7:0];
ELSE
DEST[7:0] := SRC[7:0]; FI;
(* Repeat operation for 2nd through 7th bytes in source and destination operands *)
IF DEST[63:56] < SRC[63:56] THEN
DEST[63:56] := DEST[63:56];
ELSE
DEST[63:56] := SRC[63:56]; FI;
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed unsigned dword/qword integers in the second source operand and the first
source operand and returns the minimum value for each pair of integers to the destination operand.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination
register remain unchanged.
Operation
PMINUD (128-bit Legacy SSE Version)
PMINUD instruction for 128-bit operands:
IF DEST[31:0] < SRC[31:0] THEN
DEST[31:0] := DEST[31:0];
ELSE
DEST[31:0] := SRC[31:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
IF DEST[127:96] < SRC[127:96] THEN
DEST[127:96] := DEST[127:96];
ELSE
DEST[127:96] := SRC[127:96]; FI;
DEST[MAXVL-1:128] (Unmodified)
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Legacy and VEX encoded versions: Packed byte, word, or dword integers in the low bytes of the source operand
(second operand) are sign extended to word, dword, or quadword integers and stored in packed signed bytes the
destination operand.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.
VEX.128 and EVEX.128 encoded versions: Bits (MAXVL-1:128) of the corresponding destination register are
zeroed.
VEX.256 and EVEX.256 encoded versions: Bits (MAXVL-1:256) of the corresponding destination register are
zeroed.
EVEX encoded versions: Packed byte, word or dword integers starting from the low bytes of the source operand
(second operand) are sign extended to word, dword or quadword integers and stored to the destination operand
under the writemask. The destination register is XMM, YMM or ZMM Register.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
Operation
Packed_Sign_Extend_BYTE_to_WORD(DEST, SRC)
DEST[15:0] := SignExtend(SRC[7:0]);
DEST[31:16] := SignExtend(SRC[15:8]);
DEST[47:32] := SignExtend(SRC[23:16]);
DEST[63:48] := SignExtend(SRC[31:24]);
DEST[79:64] := SignExtend(SRC[39:32]);
DEST[95:80] := SignExtend(SRC[47:40]);
DEST[111:96] := SignExtend(SRC[55:48]);
DEST[127:112] := SignExtend(SRC[63:56]);
Packed_Sign_Extend_BYTE_to_QWORD(DEST, SRC)
DEST[63:0] := SignExtend(SRC[7:0]);
DEST[127:64] := SignExtend(SRC[15:8]);
Packed_Sign_Extend_WORD_to_DWORD(DEST, SRC)
DEST[31:0] := SignExtend(SRC[15:0]);
DEST[63:32] := SignExtend(SRC[31:16]);
DEST[95:64] := SignExtend(SRC[47:32]);
DEST[127:96] := SignExtend(SRC[63:48]);
Packed_Sign_Extend_WORD_to_QWORD(DEST, SRC)
DEST[63:0] := SignExtend(SRC[15:0]);
DEST[127:64] := SignExtend(SRC[31:16]);
Packed_Sign_Extend_DWORD_to_QWORD(DEST, SRC)
DEST[63:0] := SignExtend(SRC[31:0]);
DEST[127:64] := SignExtend(SRC[63:32]);
PMOVSXBW
Packed_Sign_Extend_BYTE_to_WORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)
PMOVSXBD
Packed_Sign_Extend_BYTE_to_DWORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)
PMOVSXBQ
Packed_Sign_Extend_BYTE_to_QWORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)
PMOVSXWD
Packed_Sign_Extend_WORD_to_DWORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)
PMOVSXWQ
Packed_Sign_Extend_WORD_to_QWORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)
PMOVSXDQ
Packed_Sign_Extend_DWORD_to_QWORD(DEST[127:0], SRC[127:0])
DEST[MAXVL-1:128] (Unmodified)
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Legacy, VEX, and EVEX encoded versions: Packed byte, word, or dword integers starting from the low bytes of the
source operand (second operand) are zero extended to word, dword, or quadword integers and stored in packed
signed bytes the destination operand.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.
VEX.128 encoded version: Bits (MAXVL-1:128) of the corresponding destination register are zeroed.
VEX.256 encoded version: Bits (MAXVL-1:256) of the corresponding destination register are zeroed.
EVEX encoded versions: Packed dword integers starting from the low bytes of the source operand (second
operand) are zero extended to quadword integers and stored to the destination operand under the writemask.The
destination register is XMM, YMM or ZMM Register.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
Operation
Packed_Zero_Extend_BYTE_to_WORD(DEST, SRC)
DEST[15:0] := ZeroExtend(SRC[7:0]);
DEST[31:16] := ZeroExtend(SRC[15:8]);
DEST[47:32] := ZeroExtend(SRC[23:16]);
DEST[63:48] := ZeroExtend(SRC[31:24]);
DEST[79:64] := ZeroExtend(SRC[39:32]);
DEST[95:80] := ZeroExtend(SRC[47:40]);
DEST[111:96] := ZeroExtend(SRC[55:48]);
DEST[127:112] := ZeroExtend(SRC[63:56]);
Packed_Zero_Extend_BYTE_to_QWORD(DEST, SRC)
DEST[63:0] := ZeroExtend(SRC[7:0]);
DEST[127:64] := ZeroExtend(SRC[15:8]);
Packed_Zero_Extend_WORD_to_DWORD(DEST, SRC)
DEST[31:0] := ZeroExtend(SRC[15:0]);
DEST[63:32] := ZeroExtend(SRC[31:16]);
DEST[95:64] := ZeroExtend(SRC[47:32]);
DEST[127:96] := ZeroExtend(SRC[63:48]);
Packed_Zero_Extend_WORD_to_QWORD(DEST, SRC)
DEST[63:0] := ZeroExtend(SRC[15:0]);
DEST[127:64] := ZeroExtend(SRC[31:16]);
Packed_Zero_Extend_DWORD_to_QWORD(DEST, SRC)
DEST[63:0] := ZeroExtend(SRC[31:0]);
DEST[127:64] := ZeroExtend(SRC[63:32]);
PMOVZXBW
Packed_Zero_Extend_BYTE_to_WORD()
DEST[MAXVL-1:128] (Unmodified)
PMOVZXBD
Packed_Zero_Extend_BYTE_to_DWORD()
DEST[MAXVL-1:128] (Unmodified)
PMOVZXBQ
Packed_Zero_Extend_BYTE_to_QWORD()
DEST[MAXVL-1:128] (Unmodified)
PMOVZXWD
Packed_Zero_Extend_WORD_to_DWORD()
DEST[MAXVL-1:128] (Unmodified)
PMOVZXWQ
Packed_Zero_Extend_WORD_to_QWORD()
DEST[MAXVL-1:128] (Unmodified)
PMOVZXDQ
Packed_Zero_Extend_DWORD_to_QWORD()
DEST[MAXVL-1:128] (Unmodified)
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Multiplies packed signed doubleword integers in the even-numbered (zero-based reference) elements of the first
source operand with the packed signed doubleword integers in the corresponding elements of the second source
operand and stores packed signed quadword results in the destination operand.
128-bit Legacy SSE version: The input signed doubleword integers are taken from the even-numbered elements of
the source operands, i.e., the first (low) and third doubleword element. For 128-bit memory operands, 128 bits are
fetched from memory, but only the first and third doublewords are used in the computation. The first source
operand and the destination XMM operand is the same. The second source operand can be an XMM register or 128-
bit memory location. Bits (MAXVL-1:128) of the corresponding destination register remain unchanged.
VEX.128 encoded version: The input signed doubleword integers are taken from the even-numbered elements of
the source operands, i.e., the first (low) and third doubleword element. For 128-bit memory operands, 128 bits are
fetched from memory, but only the first and third doublewords are used in the computation.The first source
operand and the destination operand are XMM registers. The second source operand can be an XMM register or
128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination register are zeroed.
Operation
VPMULDQ (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := SignExtend64( SRC1[i+31:i]) * SignExtend64( SRC2[31:0])
ELSE DEST[i+63:i] := SignExtend64( SRC1[i+31:i]) * SignExtend64( SRC2[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
PMULHRSW multiplies vertically each signed 16-bit integer from the destination operand (first operand) with the
corresponding signed 16-bit integer of the source operand (second operand), producing intermediate, signed 32-
bit integers. Each intermediate 32-bit integer is truncated to the 18 most significant bits. Rounding is always
performed by adding 1 to the least significant bit of the 18-bit intermediate result. The final result is obtained by
selecting the 16 bits immediately to the right of the most significant bit of each 18-bit intermediate result and
packed to the destination operand.
When the source operand is a 128-bit memory operand, the operand must be aligned on a 16-byte boundary or a
general-protection exception (#GP) will be generated.
In 64-bit mode and not encoded with VEX/EVEX, use the REX prefix to access XMM8-XMM15 registers.
Operation
PMULHRSW (With 64-bit Operands)
temp0[31:0] = INT32 ((DEST[15:0] * SRC[15:0]) >>14) + 1;
temp1[31:0] = INT32 ((DEST[31:16] * SRC[31:16]) >>14) + 1;
temp2[31:0] = INT32 ((DEST[47:32] * SRC[47:32]) >> 14) + 1;
temp3[31:0] = INT32 ((DEST[63:48] * SRc[63:48]) >> 14) + 1;
DEST[15:0] = temp0[16:1];
DEST[31:16] = temp1[16:1];
DEST[47:32] = temp2[16:1];
DEST[63:48] = temp3[16:1];
DEST[15:0] := temp0[16:1]
DEST[31:16] := temp1[16:1]
DEST[47:32] := temp2[16:1]
DEST[63:48] := temp3[16:1]
DEST[79:64] := temp4[16:1]
DEST[95:80] := temp5[16:1]
DEST[111:96] := temp6[16:1]
DEST[127:112] := temp7[16:1]
DEST[143:128] := temp8[16:1]
DEST[159:144] := temp9[16:1]
DEST[175:160] := temp10[16:1]
DEST[191:176] := temp11[16:1]
DEST[207:192] := temp12[16:1]
DEST[223:208] := temp13[16:1]
DEST[239:224] := temp14[16:1]
DEST[255:240] := temp15[16:1]
DEST[MAXVL-1:256] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD unsigned multiply of the packed unsigned word integers in the destination operand (first operand)
and the source operand (second operand), and stores the high 16 bits of each 32-bit intermediate results in the
destination operand. (Figure 4-12 shows this operation when using 64-bit operands.)
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE version 64-bit operand: The source operand can be an MMX technology register or a 64-bit memory
location. The destination operand is an MMX technology register.
PMULHUW—Multiply Packed Unsigned Integers and Store High Result Vol. 2B 4-373
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destina-
tion register remain unchanged.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the destination YMM register are
zeroed. VEX.L must be 0, otherwise the instruction will #UD.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM
register conditionally updated with writemask k1.
SRC X3 X2 X1 X0
DEST Y3 Y2 Y1 Y0
TEMP Z3 = X3 ∗ Y3 Z2 = X2 ∗ Y2 Z1 = X1 ∗ Y1 Z0 = X0 ∗ Y0
Figure 4-12. PMULHUW and PMULHW Instruction Operation Using 64-bit Operands
Operation
PMULHUW (With 64-bit Operands)
TEMP0[31:0] := DEST[15:0] ∗ SRC[15:0]; (* Unsigned multiplication *)
TEMP1[31:0] := DEST[31:16] ∗ SRC[31:16];
TEMP2[31:0] := DEST[47:32] ∗ SRC[47:32];
TEMP3[31:0] := DEST[63:48] ∗ SRC[63:48];
DEST[15:0] := TEMP0[31:16];
DEST[31:16] := TEMP1[31:16];
DEST[47:32] := TEMP2[31:16];
DEST[63:48] := TEMP3[31:16];
PMULHUW—Multiply Packed Unsigned Integers and Store High Result Vol. 2B 4-374
DEST[127:112] := TEMP7[31:16];
PMULHUW—Multiply Packed Unsigned Integers and Store High Result Vol. 2B 4-375
DEST[MAXVL-1:256] := 0
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
PMULHUW—Multiply Packed Unsigned Integers and Store High Result Vol. 2B 4-376
PMULHW—Multiply Packed Signed Integers and Store High Result
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F E5 /r1 A V/V MMX Multiply the packed signed word integers in mm1
PMULHW mm, mm/m64 register and mm2/m64, and store the high 16
bits of the results in mm1.
66 0F E5 /r A V/V SSE2 Multiply the packed signed word integers in
PMULHW xmm1, xmm2/m128 xmm1 and xmm2/m128, and store the high 16
bits of the results in xmm1.
VEX.128.66.0F.WIG E5 /r B V/V AVX Multiply the packed signed word integers in
VPMULHW xmm1, xmm2, xmm3/m128 xmm2 and xmm3/m128, and store the high 16
bits of the results in xmm1.
VEX.256.66.0F.WIG E5 /r B V/V AVX2 Multiply the packed signed word integers in
VPMULHW ymm1, ymm2, ymm3/m256 ymm2 and ymm3/m256, and store the high 16
bits of the results in ymm1.
EVEX.128.66.0F.WIG E5 /r C V/V (AVX512VL AND Multiply the packed signed word integers in
VPMULHW xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm2 and xmm3/m128, and store the high 16
xmm3/m128 AVX10.12 bits of the results in xmm1 under writemask k1.
EVEX.256.66.0F.WIG E5 /r C V/V (AVX512VL AND Multiply the packed signed word integers in
VPMULHW ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm2 and ymm3/m256, and store the high 16
ymm3/m256 AVX10.12 bits of the results in ymm1 under writemask k1.
EVEX.512.66.0F.WIG E5 /r C V/V AVX512BW Multiply the packed signed word integers in
VPMULHW zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm2 and zmm3/m512, and store the high 16
zmm3/m512 bits of the results in zmm1 under writemask k1.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD signed multiply of the packed signed word integers in the destination operand (first operand) and
the source operand (second operand), and stores the high 16 bits of each intermediate 32-bit result in the destina-
tion operand. (Figure 4-12 shows this operation when using 64-bit operands.)
n 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE version 64-bit operand: The source operand can be an MMX technology register or a 64-bit memory
location. The destination operand is an MMX technology register.
PMULHW—Multiply Packed Signed Integers and Store High Result Vol. 2B 4-377
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destina-
tion register remain unchanged.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the destination YMM register are
zeroed. VEX.L must be 0, otherwise the instruction will #UD.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM
register conditionally updated with writemask k1.
Operation
PMULHW (With 64-bit Operands)
TEMP0[31:0] := DEST[15:0] ∗ SRC[15:0]; (* Signed multiplication *)
TEMP1[31:0] := DEST[31:16] ∗ SRC[31:16];
TEMP2[31:0] := DEST[47:32] ∗ SRC[47:32];
TEMP3[31:0] := DEST[63:48] ∗ SRC[63:48];
DEST[15:0] := TEMP0[31:16];
DEST[31:16] := TEMP1[31:16];
DEST[47:32] := TEMP2[31:16];
DEST[63:48] := TEMP3[31:16];
PMULHW—Multiply Packed Signed Integers and Store High Result Vol. 2B 4-378
DEST[63:48] := TEMP3[31:16]
DEST[79:64] := TEMP4[31:16]
DEST[95:80] := TEMP5[31:16]
DEST[111:96] := TEMP6[31:16]
DEST[127:112] := TEMP7[31:16]
DEST[MAXVL-1:128] := 0
PMULHW—Multiply Packed Signed Integers and Store High Result Vol. 2B 4-379
PMULHW (EVEX Encoded Versions)
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN
temp[31:0] := SRC1[i+15:i] * SRC2[i+15:i]
DEST[i+15:i] := tmp[31:16]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Flags Affected
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
PMULHW—Multiply Packed Signed Integers and Store High Result Vol. 2B 4-380
PMULLD/PMULLQ—Multiply Packed Integers and Store Low Result
Opcode/ Op/En 64/32 bit CPUID Feature Description
Instruction Mode Flag
Support
66 0F 38 40 /r A V/V SSE4_1 Multiply the packed dword signed integers in xmm1 and
PMULLD xmm1, xmm2/m128 xmm2/m128 and store the low 32 bits of each product
in xmm1.
VEX.128.66.0F38.WIG 40 /r B V/V AVX Multiply the packed dword signed integers in xmm2 and
VPMULLD xmm1, xmm2, xmm3/m128 and store the low 32 bits of each product
xmm3/m128 in xmm1.
VEX.256.66.0F38.WIG 40 /r B V/V AVX2 Multiply the packed dword signed integers in ymm2 and
VPMULLD ymm1, ymm2, ymm3/m256 and store the low 32 bits of each product
ymm3/m256 in ymm1.
EVEX.128.66.0F38.W0 40 /r C V/V (AVX512VL AND Multiply the packed dword signed integers in xmm2 and
VPMULLD xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m32bcst and store the low 32 bits of
xmm3/m128/m32bcst AVX10.11 each product in xmm1 under writemask k1.
EVEX.256.66.0F38.W0 40 /r C V/V (AVX512VL AND Multiply the packed dword signed integers in ymm2 and
VPMULLD ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m32bcst and store the low 32 bits of
ymm3/m256/m32bcst AVX10.11 each product in ymm1 under writemask k1.
EVEX.512.66.0F38.W0 40 /r C V/V AVX512F Multiply the packed dword signed integers in zmm2 and
VPMULLD zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m32bcst and store the low 32 bits of
zmm3/m512/m32bcst each product in zmm1 under writemask k1.
EVEX.128.66.0F38.W1 40 /r C V/V (AVX512VL AND Multiply the packed qword signed integers in xmm2 and
VPMULLQ xmm1 {k1}{z}, xmm2, AVX512DQ) OR xmm3/m128/m64bcst and store the low 64 bits of
xmm3/m128/m64bcst AVX10.11 each product in xmm1 under writemask k1.
EVEX.256.66.0F38.W1 40 /r C V/V (AVX512VL AND Multiply the packed qword signed integers in ymm2 and
VPMULLQ ymm1 {k1}{z}, ymm2, AVX512DQ) OR ymm3/m256/m64bcst and store the low 64 bits of
ymm3/m256/m64bcst AVX10.11 each product in ymm1 under writemask k1.
EVEX.512.66.0F38.W1 40 /r C V/V AVX512DQ Multiply the packed qword signed integers in zmm2 and
VPMULLQ zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m64bcst and store the low 64 bits of
zmm3/m512/m64bcst each product in zmm1 under writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD signed multiply of the packed signed dword/qword integers from each element of the first source
operand with the corresponding element in the second source operand. The low 32/64 bits of each 64/128-bit
intermediate results are stored to the destination operand.
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding ZMM destina-
tion register remain unchanged.
Operation
VPMULLQ (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN Temp[127:0] := SRC1[i+63:i] * SRC2[63:0]
ELSE Temp[127:0] := SRC1[i+63:i] * SRC2[i+63:i]
FI;
DEST[i+63:i] := Temp[63:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
DEST[31:0] := Temp0[31:0]
DEST[63:32] := Temp1[31:0]
DEST[95:64] := Temp2[31:0]
DEST[127:96] := Temp3[31:0]
DEST[159:128] := Temp4[31:0]
DEST[191:160] := Temp5[31:0]
DEST[223:192] := Temp6[31:0]
DEST[255:224] := Temp7[31:0]
DEST[MAXVL-1:256] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD signed multiply of the packed signed word integers in the destination operand (first operand) and
the source operand (second operand), and stores the low 16 bits of each intermediate 32-bit result in the destina-
tion operand. (Figure 4-12 shows this operation when using 64-bit operands.)
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE version 64-bit operand: The source operand can be an MMX technology register or a 64-bit memory
location. The destination operand is an MMX technology register.
PMULLW—Multiply Packed Signed Integers and Store Low Result Vol. 2B 4-385
128-bit Legacy SSE version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destina-
tion register remain unchanged.
VEX.128 encoded version: The first source and destination operands are XMM registers. The second source
operand is an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the destination YMM register are
zeroed. VEX.L must be 0, otherwise the instruction will #UD.
VEX.256 encoded version: The second source operand can be an YMM register or a 256-bit memory location. The
first source and destination operands are YMM registers.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand is a
ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination operand is conditionally updated
based on writemask k1.
SRC X3 X2 X1 X0
DEST Y3 Y2 Y1 Y0
TEMP Z3 = X3 ∗ Y3 Z2 = X2 ∗ Y2 Z1 = X1 ∗ Y1 Z0 = X0 ∗ Y0
Operation
PMULLW (With 64-bit Operands)
TEMP0[31:0] := DEST[15:0] ∗ SRC[15:0]; (* Signed multiplication *)
TEMP1[31:0] := DEST[31:16] ∗ SRC[31:16];
TEMP2[31:0] := DEST[47:32] ∗ SRC[47:32];
TEMP3[31:0] := DEST[63:48] ∗ SRC[63:48];
DEST[15:0] := TEMP0[15:0];
DEST[31:16] := TEMP1[15:0];
DEST[47:32] := TEMP2[15:0];
DEST[63:48] := TEMP3[15:0];
PMULLW—Multiply Packed Signed Integers and Store Low Result Vol. 2B 4-386
VPMULLW (VEX.128 Encoded Version)
Temp0[31:0] := SRC1[15:0] * SRC2[15:0]
Temp1[31:0] := SRC1[31:16] * SRC2[31:16]
Temp2[31:0] := SRC1[47:32] * SRC2[47:32]
Temp3[31:0] := SRC1[63:48] * SRC2[63:48]
Temp4[31:0] := SRC1[79:64] * SRC2[79:64]
Temp5[31:0] := SRC1[95:80] * SRC2[95:80]
Temp6[31:0] := SRC1[111:96] * SRC2[111:96]
Temp7[31:0] := SRC1[127:112] * SRC2[127:112]
DEST[15:0] := Temp0[15:0]
DEST[31:16] := Temp1[15:0]
DEST[47:32] := Temp2[15:0]
DEST[63:48] := Temp3[15:0]
DEST[79:64] := Temp4[15:0]
DEST[95:80] := Temp5[15:0]
DEST[111:96] := Temp6[15:0]
DEST[127:112] := Temp7[15:0]
DEST[MAXVL-1:128] := 0
Flags Affected
None.
PMULLW—Multiply Packed Signed Integers and Store Low Result Vol. 2B 4-387
SIMD Floating-Point Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
PMULLW—Multiply Packed Signed Integers and Store Low Result Vol. 2B 4-388
PMULUDQ—Multiply Packed Unsigned Doubleword Integers
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F F4 /r1 A V/V SSE2 Multiply unsigned doubleword integer in mm1 by
PMULUDQ mm1, mm2/m64 unsigned doubleword integer in mm2/m64, and
store the quadword result in mm1.
66 0F F4 /r A V/V SSE2 Multiply packed unsigned doubleword integers in
PMULUDQ xmm1, xmm2/m128 xmm1 by packed unsigned doubleword integers
in xmm2/m128, and store the quadword results
in xmm1.
VEX.128.66.0F.WIG F4 /r B V/V AVX Multiply packed unsigned doubleword integers in
VPMULUDQ xmm1, xmm2, xmm3/m128 xmm2 by packed unsigned doubleword integers
in xmm3/m128, and store the quadword results
in xmm1.
VEX.256.66.0F.WIG F4 /r B V/V AVX2 Multiply packed unsigned doubleword integers in
VPMULUDQ ymm1, ymm2, ymm3/m256 ymm2 by packed unsigned doubleword integers
in ymm3/m256, and store the quadword results
in ymm1.
EVEX.128.66.0F.W1 F4 /r C V/V (AVX512VL AND Multiply packed unsigned doubleword integers in
VPMULUDQ xmm1 {k1}{z}, xmm2, AVX512F) OR xmm2 by packed unsigned doubleword integers
xmm3/m128/m64bcst AVX10.12 in xmm3/m128/m64bcst, and store the
quadword results in xmm1 under writemask k1.
EVEX.256.66.0F.W1 F4 /r C V/V (AVX512VL AND Multiply packed unsigned doubleword integers in
VPMULUDQ ymm1 {k1}{z}, ymm2, AVX512F) OR ymm2 by packed unsigned doubleword integers
ymm3/m256/m64bcst AVX10.12 in ymm3/m256/m64bcst, and store the
quadword results in ymm1 under writemask k1.
EVEX.512.66.0F.W1 F4 /r C V/V AVX512F Multiply packed unsigned doubleword integers in
VPMULUDQ zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm2 by packed unsigned doubleword integers
zmm3/m512/m64bcst in zmm3/m512/m64bcst, and store the
quadword results in zmm1 under writemask k1.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Multiplies the first operand (destination operand) by the second operand (source operand) and stores the result in
the destination operand.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Operation
PMULUDQ (With 64-Bit Operands)
DEST[63:0] := DEST[31:0] ∗ SRC[31:0];
Flags Affected
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a bitwise logical OR operation on the source operand (second operand) and the destination operand (first
operand) and stores the result in the destination operand. Each bit of the result is set to 1 if either or both of the
corresponding bits of the first and second operands are 1; otherwise, it is set to 0.
Operation
POR (64-bit Operand)
DEST := DEST OR SRC
Flags Affected
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
SRC X7 X6 X5 X4 X3 X2 X1 X0
DEST Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0
IF VL >= 256
(* Repeat operation for bytes 16 through 31*)
TEMP31 := ABS(SRC1[255:248] - SRC2[255:248])
DEST[143:128] := SUM(TEMP16:TEMP23)
DEST[191:144] := 000000000000H
DEST[207:192] := SUM(TEMP24:TEMP31)
DEST[223:208] := 00000000000H
FI;
IF VL >= 512
(* Repeat operation for bytes 32 through 63*)
TEMP63 := ABS(SRC1[511:504] - SRC2[511:504])
DEST[271:256] := SUM(TEMP0:TEMP7)
DEST[319:272] := 000000000000H
DEST[335:320] := SUM(TEMP8:TEMP15)
DEST[383:336] := 00000000000H
DEST[399:384] := SUM(TEMP16:TEMP23)
DEST[447:400] := 000000000000H
DEST[463:448] := SUM(TEMP24:TEMP31)
DEST[511:464] := 00000000000H
FI;
DEST[MAXVL-1:VL] := 0
Flags Affected
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
PSHUFB performs in-place shuffles of bytes in the destination operand (the first operand) according to the shuffle
control mask in the source operand (the second operand). The instruction permutes the data in the destination
operand, leaving the shuffle mask unaffected. If the most significant bit (bit[7]) of each byte of the shuffle control
mask is set, then constant zero is written in the result byte. Each byte in the shuffle control mask forms an index
to permute the corresponding byte in the destination operand. The value of each index is the least significant 4 bits
(128-bit operation) or 3 bits (64-bit operation) of the shuffle control byte. When the source operand is a 128-bit
memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will
be generated.
In 64-bit mode and not encoded with VEX/EVEX, use the REX prefix to access XMM8-XMM15 registers.
Legacy SSE version 64-bit operand: Both operands can be MMX registers.
128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAXVL-
1:128) of the corresponding YMM destination register remain unchanged.
Operation
PSHUFB (With 64-bit Operands)
TEMP := DEST
for i = 0 to 7 {
if (SRC[(i * 8)+7] = 1 ) then
DEST[(i*8)+7...(i*8)+0] := 0;
else
index[2..0] := SRC[(i*8)+2 .. (i*8)+0];
DEST[(i*8)+7...(i*8)+0] := TEMP[(index*8+7)..(index*8+0)];
endif;
}
PSHUFB (with 128 bit operands)
TEMP := DEST
for i = 0 to 15 {
if (SRC[(i * 8)+7] = 1 ) then
DEST[(i*8)+7..(i*8)+0] := 0;
else
index[3..0] := SRC[(i*8)+3 .. (i*8)+0];
DEST[(i*8)+7..(i*8)+0] := TEMP[(index*8+7)..(index*8+0)];
endif
}
MM2
MM1
MM1
04H 04H 00H 00H FFH 01H 01H 01H
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Copies doublewords from source operand (second operand) and inserts them in the destination operand (first
operand) at the locations selected with the order operand (third operand). Figure 4-16 shows the operation of the
256-bit VPSHUFD instruction and the encoding of the order operand. Each 2-bit field in the order operand selects
the contents of one doubleword location within a 128-bit lane and copy to the target element in the destination
operand. For example, bits 0 and 1 of the order operand targets the first doubleword element in the low and high
128-bit lane of the destination operand for 256-bit VPSHUFD. The encoded value of bits 1:0 of the order operand
(see the field encoding in Figure 4-16) determines which doubleword element (from the respective 128-bit lane) of
the source operand will be copied to doubleword 0 of the destination operand.
For 128-bit operation, only the low 128-bit lane are operative. The source operand can be an XMM register or a
128-bit memory location. The destination operand is an XMM register. The order operand is an 8-bit immediate.
Note that this instruction permits a doubleword in the source operand to be copied to more than one doubleword
location in the destination operand.
DEST Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0
Encoding 00B - X4
of Fields in 01B - X5 Encoding 00B - X0
ORDER of Fields in 01B - X1
ORDER 10B - X6
Operand 11B - X7 7 6 5 4 3 2 1 0
ORDER 10B - X2
Operand 11B - X3
The source operand can be an XMM register or a 128-bit memory location. The destination operand is an XMM
register. The order operand is an 8-bit immediate. Note that this instruction permits a doubleword in the source
operand to be copied to more than one doubleword location in the destination operand.
In 64-bit mode and not encoded in VEX/EVEX, using REX.R permits this instruction to access XMM8-XMM15.
128-bit Legacy SSE version: Bits (MAXVL-1:128) of the corresponding YMM destination register remain
unchanged.
VEX.128 encoded version: The source operand can be an XMM register or a 128-bit memory location. The destina-
tion operand is an XMM register. Bits (MAXVL-1:128) of the corresponding ZMM register are zeroed.
VEX.256 encoded version: The source operand can be an YMM register or a 256-bit memory location. The destina-
tion operand is an YMM register. Bits (MAXVL-1:256) of the corresponding ZMM register are zeroed. Bits (255-
1:128) of the destination stores the shuffled results of the upper 16 bytes of the source operand using the imme-
diate byte as the order operand.
EVEX encoded version: The source operand can be an ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion, or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a
ZMM/YMM/XMM register updated according to the writemask.
Each 128-bit lane of the destination stores the shuffled results of the respective lane of the source operand using
the immediate byte as the order operand.
Note: EVEX.vvvv and VEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
Operation
PSHUFD (128-bit Legacy SSE Version)
DEST[31:0] := (SRC >> (ORDER[1:0] * 32))[31:0];
DEST[63:32] := (SRC >> (ORDER[3:2] * 32))[31:0];
DEST[95:64] := (SRC >> (ORDER[5:4] * 32))[31:0];
DEST[127:96] := (SRC >> (ORDER[7:6] * 32))[31:0];
DEST[MAXVL-1:128] (Unmodified)
Flags Affected
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv ≠ 1111B or EVEX.vvvv ≠ 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Copies words from the high quadword of a 128-bit lane of the source operand and inserts them in the high quad-
word of the destination operand at word locations (of the respective lane) selected with the immediate operand.
This 256-bit operation is similar to the in-lane operation used by the 256-bit VPSHUFD instruction, which is illus-
trated in Figure 4-16. For 128-bit operation, only the low 128-bit lane is operative. Each 2-bit field in the immediate
operand selects the contents of one word location in the high quadword of the destination operand. The binary
encodings of the immediate operand fields select words (0, 1, 2 or 3, 4) from the high quadword of the source
operand to be copied to the destination operand. The low quadword of the source operand is copied to the low
quadword of the destination operand, for each 128-bit lane.
Note that this instruction permits a word in the high quadword of the source operand to be copied to more than one
word location in the high quadword of the destination operand.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
128-bit Legacy SSE version: The destination operand is an XMM register. The source operand can be an XMM
register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destination register remain
unchanged.
Operation
PSHUFHW (128-bit Legacy SSE Version)
DEST[63:0] := SRC[63:0]
DEST[79:64] := (SRC >> (imm[1:0] *16))[79:64]
DEST[95:80] := (SRC >> (imm[3:2] * 16))[79:64]
DEST[111:96] := (SRC >> (imm[5:4] * 16))[79:64]
DEST[127:112] := (SRC >> (imm[7:6] * 16))[79:64]
DEST[MAXVL-1:128] (Unmodified)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i];
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Flags Affected
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B, or EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Copies words from the low quadword of a 128-bit lane of the source operand and inserts them in the low quadword
of the destination operand at word locations (of the respective lane) selected with the immediate operand. The
256-bit operation is similar to the in-lane operation used by the 256-bit VPSHUFD instruction, which is illustrated
in Figure 4-16. For 128-bit operation, only the low 128-bit lane is operative. Each 2-bit field in the immediate
operand selects the contents of one word location in the low quadword of the destination operand. The binary
encodings of the immediate operand fields select words (0, 1, 2 or 3) from the low quadword of the source operand
to be copied to the destination operand. The high quadword of the source operand is copied to the high quadword
of the destination operand, for each 128-bit lane.
Note that this instruction permits a word in the low quadword of the source operand to be copied to more than one
word location in the low quadword of the destination operand.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
128-bit Legacy SSE version: The destination operand is an XMM register. The source operand can be an XMM
register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding YMM destination register remain
unchanged.
VEX.128 encoded version: The destination operand is an XMM register. The source operand can be an XMM register
or a 128-bit memory location. Bits (MAXVL-1:128) of the destination YMM register are zeroed.
Operation
PSHUFLW (128-bit Legacy SSE Version)
DEST[15:0] := (SRC >> (imm[1:0] *16))[15:0]
DEST[31:16] := (SRC >> (imm[3:2] * 16))[15:0]
DEST[47:32] := (SRC >> (imm[5:4] * 16))[15:0]
DEST[63:48] := (SRC >> (imm[7:6] * 16))[15:0]
DEST[127:64] := SRC[127:64]
DEST[MAXVL-1:128] (Unmodified)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i];
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Flags Affected
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If VEX.vvvv != 1111B, or EVEX.vvvv != 1111B.
VEX.128.66.0F.WIG 73 /7 ib B V/V AVX Shift xmm2 left by imm8 bytes while shifting
VPSLLDQ xmm1, xmm2, imm8 in 0s and store result in xmm1.
VEX.256.66.0F.WIG 73 /7 ib B V/V AVX2 Shift ymm2 left by imm8 bytes while shifting
VPSLLDQ ymm1, ymm2, imm8 in 0s and store result in ymm1.
EVEX.128.66.0F.WIG 73 /7 ib C V/V (AVX512VL AND Shift xmm2/m128 left by imm8 bytes while
VPSLLDQ xmm1,xmm2/ m128, imm8 AVX512BW) OR shifting in 0s and store result in xmm1.
AVX10.11
EVEX.256.66.0F.WIG 73 /7 ib C V/V (AVX512VL AND Shift ymm2/m256 left by imm8 bytes while
VPSLLDQ ymm1, ymm2/m256, imm8 AVX512BW) OR shifting in 0s and store result in ymm1.
AVX10.11
EVEX.512.66.0F.WIG 73 /7 ib C V/V AVX512BW Shift zmm2/m512 left by imm8 bytes while
VPSLLDQ zmm1, zmm2/m512, imm8 OR AVX10.11 shifting in 0s and store result in zmm1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Shifts the destination operand (first operand) to the left by the number of bytes specified in the count operand
(second operand). The empty low-order bytes are cleared (set to all 0s). If the value specified by the count operand
is greater than 15, the destination operand is set to all 0s. The count operand is an 8-bit immediate.
128-bit Legacy SSE version: The source and destination operands are the same. Bits (MAXVL-1:128) of the corre-
sponding YMM destination register remain unchanged.
VEX.128 encoded version: The source and destination operands are XMM registers. Bits (MAXVL-1:128) of the
destination YMM register are zeroed.
VEX.256 encoded version: The source operand is YMM register. The destination operand is an YMM register. Bits
(MAXVL-1:256) of the corresponding ZMM register are zeroed. The count operand applies to both the low and high
128-bit lanes.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The destination operand is a ZMM/YMM/XMM register. The count operand applies to each 128-bit lanes.
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-24, “Type 7 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
VEX.128.66.0F.WIG 71 /6 ib D V/V AVX Shift words in xmm2 left by imm8 while shifting
VPSLLW xmm1, xmm2, imm8 in 0s.
VEX.256.66.0F.WIG 71 /6 ib D V/V AVX2 Shift words in ymm2 left by imm8 while shifting
VPSLLW ymm1, ymm2, imm8 in 0s.
EVEX.128.66.0F.WIG F1 /r G V/V (AVX512VL AND Shift words in xmm2 left by amount specified in
VPSLLW xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 while shifting in 0s using
xmm3/m128 AVX10.12 writemask k1.
EVEX.256.66.0F.WIG F1 /r G V/V (AVX512VL AND Shift words in ymm2 left by amount specified in
VPSLLW ymm1 {k1}{z}, ymm2, AVX512BW) OR xmm3/m128 while shifting in 0s using
xmm3/m128 AVX10.12 writemask k1.
EVEX.512.66.0F.WIG F1 /r G V/V AVX512BW Shift words in zmm2 left by amount specified in
VPSLLW zmm1 {k1}{z}, zmm2, OR AVX10.12 xmm3/m128 while shifting in 0s using
xmm3/m128 writemask k1.
EVEX.128.66.0F.WIG 71 /6 ib E V/V (AVX512VL AND Shift words in xmm2/m128 left by imm8 while
VPSLLW xmm1 {k1}{z}, xmm2/m128, AVX512BW) OR shifting in 0s using writemask k1.
imm8 AVX10.12
EVEX.256.66.0F.WIG 71 /6 ib E V/V (AVX512VL AND Shift words in ymm2/m256 left by imm8 while
VPSLLW ymm1 {k1}{z}, ymm2/m256, AVX512BW) OR shifting in 0s using writemask k1.
imm8 AVX10.12
EVEX.512.66.0F.WIG 71 /6 ib E V/V AVX512BW Shift words in zmm2/m512 left by imm8 while
VPSLLW zmm1 {k1}{z}, zmm2/m512, OR AVX10.12 shifting in 0 using writemask k1.
imm8
EVEX.128.66.0F.W0 F2 /r G V/V (AVX512VL AND Shift doublewords in xmm2 left by amount
VPSLLD xmm1 {k1}{z}, xmm2, AVX512F) OR specified in xmm3/m128 while shifting in 0s
xmm3/m128 AVX10.12 under writemask k1.
EVEX.256.66.0F.W0 F2 /r G V/V (AVX512VL AND Shift doublewords in ymm2 left by amount
VPSLLD ymm1 {k1}{z}, ymm2, AVX512F) OR specified in xmm3/m128 while shifting in 0s
xmm3/m128 AVX10.12 under writemask k1.
EVEX.512.66.0F.W0 F2 /r G V/V AVX512F Shift doublewords in zmm2 left by amount
VPSLLD zmm1 {k1}{z}, zmm2, OR AVX10.12 specified in xmm3/m128 while shifting in 0s
xmm3/m128 under writemask k1.
EVEX.128.66.0F.W0 72 /6 ib F V/V (AVX512VL AND Shift doublewords in xmm2/m128/m32bcst left
VPSLLD xmm1 {k1}{z}, AVX512F) OR by imm8 while shifting in 0s using writemask k1.
xmm2/m128/m32bcst, imm8 AVX10.12
EVEX.256.66.0F.W0 72 /6 ib F V/V (AVX512VL AND Shift doublewords in ymm2/m256/m32bcst left
VPSLLD ymm1 {k1}{z}, AVX512F) OR by imm8 while shifting in 0s using writemask k1.
ymm2/m256/m32bcst, imm8 AVX10.12
EVEX.512.66.0F.W0 72 /6 ib F V/V AVX512F Shift doublewords in zmm2/m512/m32bcst left
VPSLLD zmm1 {k1}{z}, OR AVX10.12 by imm8 while shifting in 0s using writemask k1.
zmm2/m512/m32bcst, imm8
EVEX.128.66.0F.W1 F3 /r G V/V (AVX512VL AND Shift quadwords in xmm2 left by amount
VPSLLQ xmm1 {k1}{z}, xmm2, AVX512F) OR specified in xmm3/m128 while shifting in 0s
xmm3/m128 AVX10.12 using writemask k1.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Shifts the bits in the individual data elements (words, doublewords, or quadword) in the destination operand (first
operand) to the left by the number of bits specified in the count operand (second operand). As the bits in the data
elements are shifted left, the empty low-order bits are cleared (set to 0). If the value specified by the count
operand is greater than 15 (for words), 31 (for doublewords), or 63 (for a quadword), then the destination operand
is set to all 0s. Figure 4-17 gives an example of shifting words in a 64-bit operand.
Post-Shift
DEST X3 << COUNT X2 << COUNT X1 << COUNT X0 << COUNT
Figure 4-17. PSLLW, PSLLD, and PSLLQ Instruction Operation Using 64-bit Operand
The (V)PSLLW instruction shifts each of the words in the destination operand to the left by the number of bits spec-
ified in the count operand; the (V)PSLLD instruction shifts each of the doublewords in the destination operand; and
the (V)PSLLQ instruction shifts the quadword (or quadwords) in the destination operand.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE instructions 64-bit operand: The destination operand is an MMX technology register; the count
operand can be either an MMX technology register or an 64-bit memory location.
128-bit Legacy SSE version: The destination and first source operands are XMM registers. Bits (MAXVL-1:128) of
the corresponding YMM destination register remain unchanged. The count operand can be either an XMM register
or a 128-bit memory location or an 8-bit immediate. If the count operand is a memory address, 128 bits are loaded
but the upper 64 bits are ignored.
VEX.128 encoded version: The destination and first source operands are XMM registers. Bits (MAXVL-1:128) of the
destination YMM register are zeroed. The count operand can be either an XMM register or a 128-bit memory loca-
tion or an 8-bit immediate. If the count operand is a memory address, 128 bits are loaded but the upper 64 bits are
ignored.
VEX.256 encoded version: The destination operand is a YMM register. The source operand is a YMM register or a
memory location. The count operand can come either from an XMM register or a memory location or an 8-bit
immediate. Bits (MAXVL-1:256) of the corresponding ZMM register are zeroed.
EVEX encoded versions: The destination operand is a ZMM register updated according to the writemask. The count
operand is either an 8-bit immediate (the immediate count version) or an 8-bit value from an XMM register or a
memory location (the variable count version). For the immediate count version, the source operand (the second
operand) can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32/64-bit
memory location. For the variable count version, the first source operand (the second operand) is a ZMM register,
the second source operand (the third operand, 8-bit variable count) can be an XMM register or a memory location.
Note: In VEX/EVEX encoded versions of shifts with an immediate count, vvvv of VEX/EVEX encode the destination
register, and VEX.B/EVEX.B + ModRM.r/m encodes the source register.
Note: For shifts with an immediate count (VEX.128.66.0F 71-73 /6, or EVEX.128.66.0F 71-73 /6),
VEX.vvvv/EVEX.vvvv encodes the destination register.
Operation
PSLLW (With 64-bit Operand)
IF (COUNT > 15)
THEN
DEST[64:0] := 0000000000000000H;
ELSE
DEST[15:0] := ZeroExtend(DEST[15:0] << COUNT);
(* Repeat shift operation for 2nd and 3rd words *)
DEST[63:48] := ZeroExtend(DEST[63:48] << COUNT);
FI;
PSLLD (with 64-bit operand)
IF (COUNT > 31)
THEN
LOGICAL_LEFT_SHIFT_WORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 15)
THEN
DEST[127:0] := 00000000000000000000000000000000H
ELSE
DEST[15:0] := ZeroExtend(SRC[15:0] << COUNT);
(* Repeat shift operation for 2nd through 7th words *)
DEST[127:112] := ZeroExtend(SRC[127:112] << COUNT);
FI;
LOGICAL_LEFT_SHIFT_DWORDS1(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 31)
THEN
DEST[31:0] := 0
ELSE
DEST[31:0] := ZeroExtend(SRC[31:0] << COUNT);
FI;
LOGICAL_LEFT_SHIFT_DWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 31)
THEN
DEST[127:0] := 00000000000000000000000000000000H
ELSE
DEST[31:0] := ZeroExtend(SRC[31:0] << COUNT);
(* Repeat shift operation for 2nd through 3rd words *)
DEST[127:96] := ZeroExtend(SRC[127:96] << COUNT);
FI;
LOGICAL_LEFT_SHIFT_QWORDS1(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 63)
THEN
DEST[63:0] := 0
ELSE
DEST[63:0] := ZeroExtend(SRC[63:0] << COUNT);
FI;
LOGICAL_LEFT_SHIFT_DWORDS_256b(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 31)
THEN
DEST[127:0] := 00000000000000000000000000000000H
DEST[255:128] := 00000000000000000000000000000000H
ELSE
DEST[31:0] := ZeroExtend(SRC[31:0] << COUNT);
(* Repeat shift operation for 2nd through 7th words *)
DEST[255:224] := ZeroExtend(SRC[255:224] << COUNT);
FI;
LOGICAL_LEFT_SHIFT_QWORDS_256b(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 63)
THEN
DEST[127:0] := 00000000000000000000000000000000H
DEST[255:128] := 00000000000000000000000000000000H
ELSE
DEST[63:0] := ZeroExtend(SRC[63:0] << COUNT);
DEST[127:64] := ZeroExtend(SRC[127:64] << COUNT)
DEST[191:128] := ZeroExtend(SRC[191:128] << COUNT);
DEST[255:192] := ZeroExtend(SRC[255:192] << COUNT);
FI;
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
• VEX-encoded instructions:
— Syntax with RM/RVM operand encoding (A/C in the operand encoding table), see Table 2-21, “Type 4 Class
Exception Conditions.”
— Syntax with MI/VMI operand encoding (B/D in the operand encoding table), see Table 2-24, “Type 7 Class
Exception Conditions.”
• EVEX-encoded VPSLLW (E in the operand encoding table), see Exceptions Type E4NF.nb in Table 2-52, “Type
E4NF Class Exception Conditions.”
• EVEX-encoded VPSLLD/Q:
— Syntax with Mem128 tuple type (G in the operand encoding table), see Exceptions Type E4NF.nb in
Table 2-52, “Type E4NF Class Exception Conditions.”
— Syntax with Full tuple type (F in the operand encoding table), see Table 2-51, “Type E4 Class Exception
Conditions.”
EVEX.128.66.0F.WIG E1 /r G V/V (AVX512VL AND Shift words in xmm2 right by amount specified
VPSRAW xmm1 {k1}{z}, xmm2, AVX512BW) OR in xmm3/m128 while shifting in sign bits using
xmm3/m128 AVX10.12 writemask k1.
EVEX.256.66.0F.WIG E1 /r G V/V (AVX512VL AND Shift words in ymm2 right by amount specified
VPSRAW ymm1 {k1}{z}, ymm2, AVX512BW) OR in xmm3/m128 while shifting in sign bits using
xmm3/m128 AVX10.12 writemask k1.
EVEX.512.66.0F.WIG E1 /r G V/V AVX512BW Shift words in zmm2 right by amount specified
VPSRAW zmm1 {k1}{z}, zmm2, OR AVX10.12 in xmm3/m128 while shifting in sign bits using
xmm3/m128 writemask k1.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
Description
Shifts the bits in the individual data elements (words, doublewords or quadwords) in the destination operand (first
operand) to the right by the number of bits specified in the count operand (second operand). As the bits in the data
elements are shifted right, the empty high-order bits are filled with the initial value of the sign bit of the data
element. If the value specified by the count operand is greater than 15 (for words), 31 (for doublewords), or 63 (for
quadwords), each destination data element is filled with the initial value of the sign bit of the element. (Figure 4-18
gives an example of shifting words in a 64-bit operand.)
Pre-Shift
X3 X2 X1 X0
DEST
Shift Right
with Sign
Extension
Post-Shift
DEST X3 >> COUNT X2 >> COUNT X1 >> COUNT X0 >> COUNT
Figure 4-18. PSRAW and PSRAD Instruction Operation Using a 64-bit Operand
Note that only the first 64-bits of a 128-bit count operand are checked to compute the count. If the second source
operand is a memory address, 128 bits are loaded.
The (V)PSRAW instruction shifts each of the words in the destination operand to the right by the number of bits
specified in the count operand, and the (V)PSRAD instruction shifts each of the doublewords in the destination
operand.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE instructions 64-bit operand: The destination operand is an MMX technology register; the count
operand can be either an MMX technology register or an 64-bit memory location.
128-bit Legacy SSE version: The destination and first source operands are XMM registers. Bits (MAXVL-1:128) of
the corresponding YMM destination register remain unchanged. The count operand can be either an XMM register
or a 128-bit memory location or an 8-bit immediate. If the count operand is a memory address, 128 bits are loaded
but the upper 64 bits are ignored.
VEX.128 encoded version: The destination and first source operands are XMM registers. Bits (MAXVL-1:128) of the
destination YMM register are zeroed. The count operand can be either an XMM register or a 128-bit memory loca-
Operation
PSRAW (With 64-bit Operand)
IF (COUNT > 15)
THEN COUNT := 16;
FI;
DEST[15:0] := SignExtend(DEST[15:0] >> COUNT);
(* Repeat shift operation for 2nd and 3rd words *)
DEST[63:48] := SignExtend(DEST[63:48] >> COUNT);
ARITHMETIC_RIGHT_SHIFT_DWORDS1(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 31)
THEN
DEST[31:0] := SignBit
ELSE
DEST[31:0] := SignExtend(SRC[31:0] >> COUNT);
FI;
ARITHMETIC_RIGHT_SHIFT_QWORDS1(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 63)
THEN
DEST[63:0] := SignBit
ELSE
DEST[63:0] := SignExtend(SRC[63:0] >> COUNT);
FI;
ARITHMETIC_RIGHT_SHIFT_WORDS_256b(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 15)
ARITHMETIC_RIGHT_SHIFT_DWORDS_256b(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 31)
THEN COUNT := 32;
FI;
DEST[31:0] := SignExtend(SRC[31:0] >> COUNT);
(* Repeat shift operation for 2nd through 7th words *)
DEST[255:224] := SignExtend(SRC[255:224] >> COUNT);
ARITHMETIC_RIGHT_SHIFT_WORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 15)
THEN COUNT := 16;
FI;
DEST[15:0] := SignExtend(SRC[15:0] >> COUNT);
(* Repeat shift operation for 2nd through 7th words *)
DEST[127:112] := SignExtend(SRC[127:112] >> COUNT);
ARITHMETIC_RIGHT_SHIFT_DWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 31)
THEN COUNT := 32;
FI;
DEST[31:0] := SignExtend(SRC[31:0] >> COUNT);
(* Repeat shift operation for 2nd through 3rd words *)
DEST[127:96] := SignExtend(SRC[127:96] >> COUNT);
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FOR j := 0 TO 7
i := j * 64
IF k1[j] OR *no writemask*
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
• VEX-encoded instructions:
— Syntax with RM/RVM operand encoding (A/C in the operand encoding table), see Table 2-21, “Type 4 Class
Exception Conditions.”
— Syntax with MI/VMI operand encoding (B/D in the operand encoding table), see Table 2-24, “Type 7 Class
Exception Conditions.”
• EVEX-encoded VPSRAW (E in the operand encoding table), see Exceptions Type E4NF.nb in Table 2-52, “Type
E4NF Class Exception Conditions.”
• EVEX-encoded VPSRAD/Q:
— Syntax with Mem128 tuple type (G in the operand encoding table), see Exceptions Type E4NF.nb in
Table 2-52, “Type E4NF Class Exception Conditions.”
— Syntax with Full tuple type (F in the operand encoding table), see Table 2-51, “Type E4 Class Exception
Conditions.”
VEX.256.66.0F.WIG 73 /3 ib B V/V AVX2 Shift ymm1 right by imm8 bytes while shifting in
VPSRLDQ ymm1, ymm2, imm8 0s.
EVEX.128.66.0F.WIG 73 /3 ib C V/V (AVX512VL AND Shift xmm2/m128 right by imm8 bytes while
VPSRLDQ xmm1, xmm2/m128, imm8 AVX512BW) OR shifting in 0s and store result in xmm1.
AVX10.11
EVEX.256.66.0F.WIG 73 /3 ib C V/V (AVX512VL AND Shift ymm2/m256 right by imm8 bytes while
VPSRLDQ ymm1, ymm2/m256, imm8 AVX512BW) OR shifting in 0s and store result in ymm1.
AVX10.11
EVEX.512.66.0F.WIG 73 /3 ib C V/V AVX512BW Shift zmm2/m512 right by imm8 bytes while
VPSRLDQ zmm1, zmm2/m512, imm8 OR AVX10.11 shifting in 0s and store result in zmm1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Shifts the destination operand (first operand) to the right by the number of bytes specified in the count operand
(second operand). The empty high-order bytes are cleared (set to all 0s). If the value specified by the count
operand is greater than 15, the destination operand is set to all 0s. The count operand is an 8-bit immediate.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
128-bit Legacy SSE version: The source and destination operands are the same. Bits (MAXVL-1:128) of the corre-
sponding YMM destination register remain unchanged.
VEX.128 encoded version: The source and destination operands are XMM registers. Bits (MAXVL-1:128) of the
destination YMM register are zeroed.
VEX.256 encoded version: The source operand is a YMM register. The destination operand is a YMM register. The
count operand applies to both the low and high 128-bit lanes.
VEX.256 encoded version: The source operand is YMM register. The destination operand is an YMM register. Bits
(MAXVL-1:256) of the corresponding ZMM register are zeroed. The count operand applies to both the low and high
128-bit lanes.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The destination operand is a ZMM/YMM/XMM register. The count operand applies to each 128-bit lanes.
Note: VEX.vvvv/EVEX.vvvv encodes the destination register.
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-24, “Type 7 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
EVEX.128.66.0F.WIG D1 /r G V/V (AVX512VL AND Shift words in xmm2 right by amount specified
VPSRLW xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512BW) OR in xmm3/m128 while shifting in 0s using
AVX10.12 writemask k1.
EVEX.256.66.0F.WIG D1 /r G V/V (AVX512VL AND Shift words in ymm2 right by amount specified
VPSRLW ymm1 {k1}{z}, ymm2, xmm3/m128 AVX512BW) OR in xmm3/m128 while shifting in 0s using
AVX10.12 writemask k1.
EVEX.512.66.0F.WIG D1 /r G V/V AVX512BW OR Shift words in zmm2 right by amount specified
VPSRLW zmm1 {k1}{z}, zmm2, xmm3/m128 AVX10.12 in xmm3/m128 while shifting in 0s using
writemask k1.
EVEX.128.66.0F.WIG 71 /2 ib E V/V (AVX512VL AND Shift words in xmm2/m128 right by imm8
VPSRLW xmm1 {k1}{z}, xmm2/m128, imm8 AVX512BW) OR while shifting in 0s using writemask k1.
AVX10.12
EVEX.256.66.0F.WIG 71 /2 ib E V/V (AVX512VL AND Shift words in ymm2/m256 right by imm8
VPSRLW ymm1 {k1}{z}, ymm2/m256, imm8 AVX512BW) OR while shifting in 0s using writemask k1.
AVX10.12
EVEX.512.66.0F.WIG 71 /2 ib E V/V AVX512BW OR Shift words in zmm2/m512 right by imm8
VPSRLW zmm1 {k1}{z}, zmm2/m512, imm8 AVX10.12 while shifting in 0s using writemask k1.
EVEX.128.66.0F.W0 D2 /r G V/V (AVX512VL AND Shift doublewords in xmm2 right by amount
VPSRLD xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512F) OR specified in xmm3/m128 while shifting in 0s
AVX10.12 using writemask k1.
EVEX.256.66.0F.W0 D2 /r G V/V (AVX512VL AND Shift doublewords in ymm2 right by amount
VPSRLD ymm1 {k1}{z}, ymm2, xmm3/m128 AVX512F) OR specified in xmm3/m128 while shifting in 0s
AVX10.12 using writemask k1.
EVEX.512.66.0F.W0 D2 /r G V/V AVX512F Shift doublewords in zmm2 right by amount
VPSRLD zmm1 {k1}{z}, zmm2, xmm3/m128 OR AVX10.12 specified in xmm3/m128 while shifting in 0s
using writemask k1.
EVEX.128.66.0F.W0 72 /2 ib F V/V (AVX512VL AND Shift doublewords in xmm2/m128/m32bcst
VPSRLD xmm1 {k1}{z}, AVX512F) OR right by imm8 while shifting in 0s using
xmm2/m128/m32bcst, imm8 AVX10.12 writemask k1.
EVEX.256.66.0F.W0 72 /2 ib F V/V (AVX512VL AND Shift doublewords in ymm2/m256/m32bcst
VPSRLD ymm1 {k1}{z}, AVX512F) OR right by imm8 while shifting in 0s using
ymm2/m256/m32bcst, imm8 AVX10.12 writemask k1.
EVEX.512.66.0F.W0 72 /2 ib F V/V AVX512F Shift doublewords in zmm2/m512/m32bcst
VPSRLD zmm1 {k1}{z}, OR AVX10.12 right by imm8 while shifting in 0s using
zmm2/m512/m32bcst, imm8 writemask k1.
EVEX.128.66.0F.W1 D3 /r G V/V (AVX512VL AND Shift quadwords in xmm2 right by amount
VPSRLQ xmm1 {k1}{z}, xmm2, xmm3/m128 AVX512F) OR specified in xmm3/m128 while shifting in 0s
AVX10.12 using writemask k1.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Shifts the bits in the individual data elements (words, doublewords, or quadword) in the destination operand (first
operand) to the right by the number of bits specified in the count operand (second operand). As the bits in the data
elements are shifted right, the empty high-order bits are cleared (set to 0). If the value specified by the count
operand is greater than 15 (for words), 31 (for doublewords), or 63 (for a quadword), then the destination operand
is set to all 0s. Figure 4-19 gives an example of shifting words in a 64-bit operand.
Note that only the low 64-bits of a 128-bit count operand are checked to compute the count.
Post-Shift
DEST X3 >> COUNT X2 >> COUNT X1 >> COUNT X0 >> COUNT
Figure 4-19. PSRLW, PSRLD, and PSRLQ Instruction Operation Using 64-bit Operand
The (V)PSRLW instruction shifts each of the words in the destination operand to the right by the number of bits
specified in the count operand; the (V)PSRLD instruction shifts each of the doublewords in the destination operand;
and the PSRLQ instruction shifts the quadword (or quadwords) in the destination operand.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE instruction 64-bit operand: The destination operand is an MMX technology register; the count operand
can be either an MMX technology register or an 64-bit memory location.
128-bit Legacy SSE version: The destination operand is an XMM register; the count operand can be either an XMM
register or a 128-bit memory location, or an 8-bit immediate. If the count operand is a memory address, 128 bits
are loaded but the upper 64 bits are ignored. Bits (MAXVL-1:128) of the corresponding YMM destination register
remain unchanged.
VEX.128 encoded version: The destination operand is an XMM register; the count operand can be either an XMM
register or a 128-bit memory location, or an 8-bit immediate. If the count operand is a memory address, 128 bits
are loaded but the upper 64 bits are ignored. Bits (MAXVL-1:128) of the destination YMM register are zeroed.
VEX.256 encoded version: The destination operand is a YMM register. The source operand is a YMM register or a
memory location. The count operand can come either from an XMM register or a memory location or an 8-bit
immediate. Bits (MAXVL-1:256) of the corresponding ZMM register are zeroed.
EVEX encoded versions: The destination operand is a ZMM register updated according to the writemask. The count
operand is either an 8-bit immediate (the immediate count version) or an 8-bit value from an XMM register or a
memory location (the variable count version). For the immediate count version, the source operand (the second
operand) can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32/64-bit
memory location. For the variable count version, the first source operand (the second operand) is a ZMM register,
the second source operand (the third operand, 8-bit variable count) can be an XMM register or a memory location.
Note: In VEX/EVEX encoded versions of shifts with an immediate count, vvvv of VEX/EVEX encode the destination
register, and VEX.B/EVEX.B + ModRM.r/m encodes the source register.
Note: For shifts with an immediate count (VEX.128.66.0F 71-73 /2, or EVEX.128.66.0F 71-73 /2),
VEX.vvvv/EVEX.vvvv encodes the destination register.
Operation
PSRLW (With 64-bit Operand)
IF (COUNT > 15)
THEN
DEST[64:0] := 0000000000000000H
ELSE
DEST[15:0] := ZeroExtend(DEST[15:0] >> COUNT);
(* Repeat shift operation for 2nd and 3rd words *)
DEST[63:48] := ZeroExtend(DEST[63:48] >> COUNT);
FI;
LOGICAL_RIGHT_SHIFT_QWORDS1(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 63)
THEN
DEST[63:0] := 0
ELSE
DEST[63:0] := ZeroExtend(SRC[63:0] >> COUNT);
FI;
LOGICAL_RIGHT_SHIFT_WORDS_256b(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 15)
THEN
DEST[255:0] := 0
ELSE
DEST[15:0] := ZeroExtend(SRC[15:0] >> COUNT);
(* Repeat shift operation for 2nd through 15th words *)
DEST[255:240] := ZeroExtend(SRC[255:240] >> COUNT);
FI;
LOGICAL_RIGHT_SHIFT_WORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 15)
THEN
DEST[127:0] := 00000000000000000000000000000000H
ELSE
DEST[15:0] := ZeroExtend(SRC[15:0] >> COUNT);
(* Repeat shift operation for 2nd through 7th words *)
DEST[127:112] := ZeroExtend(SRC[127:112] >> COUNT);
FI;
LOGICAL_RIGHT_SHIFT_DWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 31)
THEN
DEST[127:0] := 00000000000000000000000000000000H
ELSE
DEST[31:0] := ZeroExtend(SRC[31:0] >> COUNT);
(* Repeat shift operation for 2nd through 3rd words *)
DEST[127:96] := ZeroExtend(SRC[127:96] >> COUNT);
FI;
LOGICAL_RIGHT_SHIFT_QWORDS_256b(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 63)
THEN
DEST[255:0] := 0
ELSE
DEST[63:0] := ZeroExtend(SRC[63:0] >> COUNT);
DEST[127:64] := ZeroExtend(SRC[127:64] >> COUNT);
DEST[191:128] := ZeroExtend(SRC[191:128] >> COUNT);
DEST[255:192] := ZeroExtend(SRC[255:192] >> COUNT);
FI;
LOGICAL_RIGHT_SHIFT_QWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC[63:0];
IF (COUNT > 63)
THEN
DEST[127:0] := 00000000000000000000000000000000H
ELSE
DEST[63:0] := ZeroExtend(SRC[63:0] >> COUNT);
DEST[127:64] := ZeroExtend(SRC[127:64] >> COUNT);
FI;
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
• VEX-encoded instructions:
— Syntax with RM/RVM operand encoding (A/C in the operand encoding table), see Table 2-21, “Type 4 Class
Exception Conditions.”
— Syntax with MI/VMI operand encoding (B/D in the operand encoding table), see Table 2-24, “Type 7 Class
Exception Conditions.”
• EVEX-encoded VPSRLW (E in the operand encoding table), see Exceptions Type E4NF.nb in Table 2-52, “Type
E4NF Class Exception Conditions.”
• EVEX-encoded VPSRLD/Q:
— Syntax with Mem128 tuple type (G in the operand encoding table), see Exceptions Type E4NF.nb in
Table 2-52, “Type E4NF Class Exception Conditions.”
— Syntax with Full tuple type (F in the operand encoding table), see Table 2-51, “Type E4 Class Exception
Conditions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD subtract of the packed integers of the source operand (second operand) from the packed integers
of the destination operand (first operand), and stores the packed integer results in the destination operand. See
Figure 9-4 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a
SIMD operation. Overflow is handled with wraparound, as described in the following paragraphs.
The (V)PSUBB instruction subtracts packed byte integers. When an individual result is too large or too small to be
represented in a byte, the result is wrapped around and the low 8 bits are written to the destination element.
The (V)PSUBW instruction subtracts packed word integers. When an individual result is too large or too small to be
represented in a word, the result is wrapped around and the low 16 bits are written to the destination element.
The (V)PSUBD instruction subtracts packed doubleword integers. When an individual result is too large or too small
to be represented in a doubleword, the result is wrapped around and the low 32 bits are written to the destination
element.
Note that the (V)PSUBB, (V)PSUBW, and (V)PSUBD instructions can operate on either unsigned or signed (two's
complement notation) packed integers; however, it does not set bits in the EFLAGS register to indicate overflow
and/or a carry. To prevent undetected overflow conditions, software must control the ranges of values upon which
it operates.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE version 64-bit operand: The destination operand must be an MMX technology register and the source
operand can be either an MMX technology register or a 64-bit memory location.
Operation
PSUBB (With 64-bit Operands)
DEST[7:0] := DEST[7:0] − SRC[7:0];
(* Repeat subtract operation for 2nd through 7th byte *)
DEST[63:56] := DEST[63:56] − SRC[63:56];
Flags Affected
None.
Numeric Exceptions
None.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Subtracts the second operand (source operand) from the first operand (destination operand) and stores the result
in the destination operand. When packed quadword operands are used, a SIMD subtract is performed. When a
quadword result is too large to be represented in 64 bits (overflow), the result is wrapped around and the low 64
bits are written to the destination element (that is, the carry is ignored).
Note that the (V)PSUBQ instruction can operate on either unsigned or signed (two’s complement notation) inte-
gers; however, it does not set bits in the EFLAGS register to indicate overflow and/or a carry. To prevent undetected
overflow conditions, software must control the ranges of the values upon which it operates.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE version 64-bit operand: The source operand can be a quadword integer stored in an MMX technology
register or a 64-bit memory location.
Operation
PSUBQ (With 64-Bit Operands)
DEST[63:0] := DEST[63:0] − SRC[63:0];
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPSUBQ, see Table 2-51, “Type E4 Class Exception Conditions.”
Description
Performs a SIMD subtract of the packed signed integers of the source operand (second operand) from the packed
signed integers of the destination operand (first operand), and stores the packed integer results in the destination
operand. See Figure 9-4 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an
illustration of a SIMD operation. Overflow is handled with signed saturation, as described in the following para-
graphs.
The (V)PSUBSB instruction subtracts packed signed byte integers. When an individual byte result is beyond the
range of a signed byte integer (that is, greater than 7FH or less than 80H), the saturated value of 7FH or 80H,
respectively, is written to the destination operand.
The (V)PSUBSW instruction subtracts packed signed word integers. When an individual word result is beyond the
range of a signed word integer (that is, greater than 7FFFH or less than 8000H), the saturated value of 7FFFH or
8000H, respectively, is written to the destination operand.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE version 64-bit operand: The destination operand must be an MMX technology register and the source
operand can be either an MMX technology register or a 64-bit memory location.
128-bit Legacy SSE version: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM desti-
nation register remain unchanged.
VEX.128 encoded version: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the destination YMM register
are zeroed.
VEX.256 encoded versions: The second source operand is an YMM register or an 256-bit memory location. The first
source operand and destination operands are YMM registers. Bits (MAXVL-1:256) of the corresponding ZMM
register are zeroed.
EVEX encoded version: The second source operand is an ZMM/YMM/XMM register or an 512/256/128-bit memory
location. The first source operand and destination operands are ZMM/YMM/XMM registers. The destination is condi-
tionally updated with writemask k1.
Operation
PSUBSB (With 64-bit Operands)
DEST[7:0] := SaturateToSignedByte (DEST[7:0] − SRC (7:0]);
(* Repeat subtract operation for 2nd through 7th bytes *)
DEST[63:56] := SaturateToSignedByte (DEST[63:56] − SRC[63:56] );
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Description
Performs a SIMD subtract of the packed unsigned integers of the source operand (second operand) from the
packed unsigned integers of the destination operand (first operand), and stores the packed unsigned integer
results in the destination operand. See Figure 9-4 in the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 1, for an illustration of a SIMD operation. Overflow is handled with unsigned saturation, as
described in the following paragraphs.
These instructions can operate on either 64-bit or 128-bit operands.
The (V)PSUBUSB instruction subtracts packed unsigned byte integers. When an individual byte result is less than
zero, the saturated value of 00H is written to the destination operand.
The (V)PSUBUSW instruction subtracts packed unsigned word integers. When an individual word result is less than
zero, the saturated value of 0000H is written to the destination operand.
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE version 64-bit operand: The destination operand must be an MMX technology register and the source
operand can be either an MMX technology register or a 64-bit memory location.
128-bit Legacy SSE version: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM desti-
nation register remain unchanged.
VEX.128 encoded version: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the destination YMM register
are zeroed.
VEX.256 encoded versions: The second source operand is an YMM register or an 256-bit memory location. The first
source operand and destination operands are YMM registers. Bits (MAXVL-1:256) of the corresponding ZMM
register are zeroed.
EVEX encoded version: The second source operand is an ZMM/YMM/XMM register or an 512/256/128-bit memory
location. The first source operand and destination operands are ZMM/YMM/XMM registers. The destination is condi-
tionally updated with writemask k1.
Operation
PSUBUSB (With 64-bit Operands)
DEST[7:0] := SaturateToUnsignedByte (DEST[7:0] − SRC (7:0] );
(* Repeat add operation for 2nd through 7th bytes *)
DEST[63:56] := SaturateToUnsignedByte (DEST[63:56] − SRC[63:56];
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
VEX.128.66.0F.WIG 68/r B V/V AVX Interleave high-order bytes from xmm2 and
VPUNPCKHBW xmm1,xmm2, xmm3/m128 xmm3/m128 into xmm1.
VEX.128.66.0F.WIG 69/r B V/V AVX Interleave high-order words from xmm2 and
VPUNPCKHWD xmm1,xmm2, xmm3/m128 xmm3/m128 into xmm1.
EVEX.512.66.0F.WIG 68/r C V/V AVX512BW Interleave high-order bytes from zmm2 and
VPUNPCKHBW zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm3/m512 into zmm1 register.
zmm3/m512
EVEX.512.66.0F.WIG 69/r C V/V AVX512BW Interleave high-order words from zmm2 and
VPUNPCKHWD zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm3/m512 into zmm1 register.
zmm3/m512
EVEX.512.66.0F.W0 6A /r D V/V AVX512F Interleave high-order doublewords from
VPUNPCKHDQ zmm1 {k1}{z}, zmm2, OR AVX10.12 zmm2 and zmm3/m512/m32bcst into zmm1
zmm3/m512/m32bcst register using k1 write mask.
EVEX.512.66.0F.W1 6D /r D V/V AVX512F Interleave high-order quadword from zmm2
VPUNPCKHQDQ zmm1 {k1}{z}, zmm2, OR AVX10.12 and zmm3/m512/m64bcst into zmm1 register
zmm3/m512/m64bcst using k1 write mask.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX Reg-
isters,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Unpacks and interleaves the high-order data elements (bytes, words, doublewords, or quadwords) of the destina-
tion operand (first operand) and source operand (second operand) into the destination operand. Figure 4-20 shows
the unpack operation for bytes in 64-bit operands. The low-order data elements are ignored.
DEST Y7 X7 Y6 X6 Y5 X5 Y4 X4
255 31 0 255 31 0
SRC Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0 X7 X6 X5 X4 X3 X2 X1 X0
255 0
DEST Y7 X7 Y6 X6 Y3 X3 Y2 X2
When the source data comes from a 64-bit memory operand, the full 64-bit operand is accessed from memory, but
the instruction uses only the high-order 32 bits. When the source data comes from a 128-bit memory operand, an
implementation may fetch only the appropriate 64 bits; however, alignment to a 16-byte boundary and normal
segment checking will still be enforced.
The (V)PUNPCKHBW instruction interleaves the high-order bytes of the source and destination operands, the
(V)PUNPCKHWD instruction interleaves the high-order words of the source and destination operands, the (V)PUNP-
CKHDQ instruction interleaves the high-order doubleword (or doublewords) of the source and destination oper-
ands, and the (V)PUNPCKHQDQ instruction interleaves the high-order quadwords of the source and destination
operands.
These instructions can be used to convert bytes to words, words to doublewords, doublewords to quadwords, and
quadwords to double quadwords, respectively, by placing all 0s in the source operand. Here, if the source operand
contains all 0s, the result (stored in the destination operand) contains zero extensions of the high-order data
elements from the original value in the destination operand. For example, with the (V)PUNPCKHBW instruction the
high-order bytes are zero extended (that is, unpacked into unsigned word integers), and with the (V)PUNPCKHWD
instruction, the high-order words are zero extended (unpacked into unsigned doubleword integers).
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE versions 64-bit operand: The source operand can be an MMX technology register or a 64-bit memory
location. The destination operand is an MMX technology register.
128-bit Legacy SSE versions: The second source operand is an XMM register or a 128-bit memory location. The
first source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM
destination register remain unchanged.
VEX.128 encoded versions: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the destination YMM register
are zeroed.
VEX.256 encoded version: The second source operand is an YMM register or an 256-bit memory location. The first
source operand and destination operands are YMM registers.
Operation
PUNPCKHBW Instruction With 64-bit Operands:
DEST[7:0] := DEST[39:32];
DEST[15:8] := SRC[39:32];
DEST[23:16] := DEST[47:40];
DEST[31:24] := SRC[47:40];
DEST[39:32] := DEST[55:48];
DEST[47:40] := SRC[55:48];
DEST[55:48] := DEST[63:56];
DEST[63:56] := SRC[63:56];
INTERLEAVE_HIGH_WORDS_256b(SRC1, SRC2)
DEST[15:0] := SRC1[79:64]
DEST[31:16] := SRC2[79:64]
DEST[47:32] := SRC1[95:80]
DEST[63:48] := SRC2[95:80]
DEST[79:64] := SRC1[111:96]
DEST[95:80] := SRC2[111:96]
DEST[111:96] := SRC1[127:112]
DEST[127:112] := SRC2[127:112]
DEST[143:128] := SRC1[207:192]
DEST[159:144] := SRC2[207:192]
DEST[175:160] := SRC1[223:208]
DEST[191:176] := SRC2[223:208]
DEST[207:192] := SRC1[239:224]
DEST[223:208] := SRC2[239:224]
DEST[239:224] := SRC1[255:240]
DEST[255:240] := SRC2[255:240]
INTERLEAVE_HIGH_DWORDS_256b(SRC1, SRC2)
DEST[31:0] := SRC1[95:64]
DEST[63:32] := SRC2[95:64]
DEST[95:64] := SRC1[127:96]
DEST[127:96] := SRC2[127:96]
DEST[159:128] := SRC1[223:192]
DEST[191:160] := SRC2[223:192]
DEST[223:192] := SRC1[255:224]
DEST[255:224] := SRC2[255:224]
INTERLEAVE_HIGH_DWORDS(SRC1, SRC2)
DEST[31:0] := SRC1[95:64]
DEST[63:32] := SRC2[95:64]
DEST[95:64] := SRC1[127:96]
DEST[127:96] := SRC2[127:96]
INTERLEAVE_HIGH_QWORDS_256b(SRC1, SRC2)
DEST[63:0] := SRC1[127:64]
DEST[127:64] := SRC2[127:64]
DEST[191:128] := SRC1[255:192]
DEST[255:192] := SRC2[255:192]
INTERLEAVE_HIGH_QWORDS(SRC1, SRC2)
DEST[63:0] := SRC1[127:64]
DEST[127:64] := SRC2[127:64]
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := TMP_DEST[i+7:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPUNPCKHQDQ/QDQ, see Table 2-52, “Type E4NF Class Exception Conditions.”
EVEX-encoded VPUNPCKHBW/WD, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Condi-
tions.”
VEX.128.66.0F.WIG 60/r B V/V AVX Interleave low-order bytes from xmm2 and
VPUNPCKLBW xmm1,xmm2, xmm3/m128 xmm3/m128 into xmm1.
VEX.128.66.0F.WIG 61/r B V/V AVX Interleave low-order words from xmm2 and
VPUNPCKLWD xmm1,xmm2, xmm3/m128 xmm3/m128 into xmm1.
EVEX.128.66.0F.WIG 60 /r C V/V (AVX512VL AND Interleave low-order bytes from xmm2 and
VPUNPCKLBW xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 into xmm1 register subject to
xmm3/m128 AVX10.12 write mask k1.
EVEX.128.66.0F.WIG 61 /r C V/V (AVX512VL AND Interleave low-order words from xmm2 and
VPUNPCKLWD xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 into xmm1 register subject to
xmm3/m128 AVX10.12 write mask k1.
EVEX.128.66.0F.W0 62 /r D V/V (AVX512VL AND Interleave low-order doublewords from xmm2
VPUNPCKLDQ xmm1 {k1}{z}, xmm2, AVX512F) OR and xmm3/m128/m32bcst into xmm1
xmm3/m128/m32bcst AVX10.12 register subject to write mask k1.
EVEX.128.66.0F.W1 6C /r D V/V (AVX512VL AND Interleave low-order quadword from zmm2
VPUNPCKLQDQ xmm1 {k1}{z}, xmm2, AVX512F) OR and zmm3/m512/m64bcst into zmm1
xmm3/m128/m64bcst AVX10.12 register subject to write mask k1.
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Unpacks and interleaves the low-order data elements (bytes, words, doublewords, and quadwords) of the destina-
tion operand (first operand) and source operand (second operand) into the destination operand. (Figure 4-22
shows the unpack operation for bytes in 64-bit operands.). The high-order data elements are ignored.
DEST Y3 X3 Y2 X2 Y1 X1 Y0 X0
255 31 0 255 31 0
SRC Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0 X7 X6 X5 X4 X3 X2 X1 X0
255 0
DEST Y5 X5 Y4 X4 Y1 X1 Y0 X0
When the source data comes from a 128-bit memory operand, an implementation may fetch only the appropriate
64 bits; however, alignment to a 16-byte boundary and normal segment checking will still be enforced.
The (V)PUNPCKLBW instruction interleaves the low-order bytes of the source and destination operands, the
(V)PUNPCKLWD instruction interleaves the low-order words of the source and destination operands, the (V)PUNP-
CKLDQ instruction interleaves the low-order doubleword (or doublewords) of the source and destination operands,
and the (V)PUNPCKLQDQ instruction interleaves the low-order quadwords of the source and destination operands.
These instructions can be used to convert bytes to words, words to doublewords, doublewords to quadwords, and
quadwords to double quadwords, respectively, by placing all 0s in the source operand. Here, if the source operand
contains all 0s, the result (stored in the destination operand) contains zero extensions of the high-order data
elements from the original value in the destination operand. For example, with the (V)PUNPCKLBW instruction the
high-order bytes are zero extended (that is, unpacked into unsigned word integers), and with the (V)PUNPCKLWD
instruction, the high-order words are zero extended (unpacked into unsigned doubleword integers).
In 64-bit mode and not encoded with VEX/EVEX, using a REX prefix in the form of REX.R permits this instruction to
access additional registers (XMM8-XMM15).
Legacy SSE versions 64-bit operand: The source operand can be an MMX technology register or a 32-bit memory
location. The destination operand is an MMX technology register.
128-bit Legacy SSE versions: The second source operand is an XMM register or a 128-bit memory location. The
first source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the corresponding YMM
destination register remain unchanged.
VEX.128 encoded versions: The second source operand is an XMM register or a 128-bit memory location. The first
source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the destination YMM register
are zeroed.
VEX.256 encoded version: The second source operand is an YMM register or an 256-bit memory location. The first
source operand and destination operands are YMM registers. Bits (MAXVL-1:256) of the corresponding ZMM
register are zeroed.
EVEX encoded VPUNPCKLDQ/QDQ: The second source operand is a ZMM/YMM/XMM register, a 512/256/128-bit
memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. The first source
Operation
PUNPCKLBW Instruction With 64-bit Operands:
DEST[63:56] := SRC[31:24];
DEST[55:48] := DEST[31:24];
DEST[47:40] := SRC[23:16];
DEST[39:32] := DEST[23:16];
DEST[31:24] := SRC[15:8];
DEST[23:16] := DEST[15:8];
DEST[15:8] := SRC[7:0];
DEST[7:0] := DEST[7:0];
INTERLEAVE_WORDS_256b(SRC1, SRC2)
DEST[15:0] := SRC1[15:0]
DEST[31:16] := SRC2[15:0]
DEST[47:32] := SRC1[31:16]
DEST[63:48] := SRC2[31:16]
DEST[79:64] := SRC1[47:32]
DEST[95:80] := SRC2[47:32]
DEST[111:96] := SRC1[63:48]
DEST[127:112] := SRC2[63:48]
DEST[143:128] := SRC1[143:128]
DEST[159:144] := SRC2[143:128]
DEST[175:160] := SRC1[159:144]
DEST[191:176] := SRC2[159:144]
DEST[207:192] := SRC1[175:160]
DEST[223:208] := SRC2[175:160]
DEST[239:224] := SRC1[191:176]
DEST[255:240] := SRC2[191:176]
INTERLEAVE_DWORDS_256b(SRC1, SRC2)
DEST[31:0] := SRC1[31:0]
DEST[63:32] := SRC2[31:0]
DEST[95:64] := SRC1[63:32]
DEST[127:96] := SRC2[63:32]
DEST[159:128] := SRC1[159:128]
DEST[191:160] := SRC2[159:128]
DEST[223:192] := SRC1[191:160]
DEST[255:224] := SRC2[191:160]
INTERLEAVE_DWORDS(SRC1, SRC2)
DEST[31:0] := SRC1[31:0]
DEST[63:32] := SRC2[31:0]
DEST[95:64] := SRC1[63:32]
DEST[127:96] := SRC2[63:32]
INTERLEAVE_QWORDS_512b (SRC1, SRC2)
TMP_DEST[255:0] := INTERLEAVE_QWORDS_256b(SRC1[255:0], SRC2[255:0])
TMP_DEST[511:256] := INTERLEAVE_QWORDS_256b(SRC1[511:256], SRC2[511:256])
INTERLEAVE_QWORDS_256b(SRC1, SRC2)
DEST[63:0] := SRC1[63:0]
DEST[127:64] := SRC2[63:0]
DEST[191:128] := SRC1[191:128]
DEST[255:192] := SRC2[191:128]
INTERLEAVE_QWORDS(SRC1, SRC2)
DEST[63:0] := SRC1[63:0]
DEST[127:64] := SRC2[63:0]
PUNPCKLBW
DEST[127:0] := INTERLEAVE_BYTES(DEST, SRC)
DEST[255:127] (Unmodified)
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := TMP_DEST[i+7:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
DEST[511:0] := INTERLEAVE_BYTES_512b(SRC1, SRC2)
PUNPCKLWD
DEST[127:0] := INTERLEAVE_WORDS(DEST, SRC)
DEST[255:127] (Unmodified)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
PUNPCKLDQ
DEST[127:0] := INTERLEAVE_DWORDS(DEST, SRC)
DEST[MAXVL-1:128] (Unmodified)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
PUNPCKLQDQ
DEST[127:0] := INTERLEAVE_QWORDS(DEST, SRC)
DEST[MAXVL-1:128] (Unmodified)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPUNPCKLDQ/QDQ, see Table 2-52, “Type E4NF Class Exception Conditions.”
EVEX-encoded VPUNPCKLBW/WD, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Condi-
tions.”
NOTES:
1. See note in Section 2.5, “Intel® AVX and Intel® SSE Instruction Exception Classification,” in the Intel® 64 and IA-32 Architectures Soft-
ware Developer’s Manual, Volume 2A, and Section 24.25.3, “Exception Conditions of Legacy SIMD Instructions Operating on MMX
Registers,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a bitwise logical exclusive-OR (XOR) operation on the source operand (second operand) and the destina-
tion operand (first operand) and stores the result in the destination operand. Each bit of the result is 1 if the corre-
sponding bits of the two operands are different; each bit is 0 if the corresponding bits of the operands are the
same.
Operation
PXOR (64-bit Operand)
DEST := DEST XOR SRC
Flags Affected
None.
Numeric Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
Description
This instruction reads a software-provided list of up to 64 MSRs and stores their values in memory.
RDMSRLIST takes three implied input operands:
• RSI: Linear address of a table of MSR addresses (8 bytes per address)1.
• RDI: Linear address of a table into which MSR data is stored (8 bytes per MSR).
• RCX: 64-bit bitmask of valid bits for the MSRs. Bit 0 is the valid bit for entry 0 in each table, etc.
For each RCX bit [n] from 0 to 63, if RCX[n] is 1, RDMSRLIST will read the MSR specified at entry [n] in the RSI-
based table and write it out to memory at the entry [n] in the RDI-based table.
This implies a maximum of 64 MSRs that can be processed by this instruction. The processor will clear RCX[n] after
it finishes handling that MSR. Similar to repeated string operations, RDMSRLIST supports partial completion for
interrupts, exceptions, and traps. In these situations, the RIP register saved will point to the RDMSRLIST instruc-
tion while the RCX register will have cleared bits corresponding to all completed iterations.
This instruction must be executed at privilege level 0; otherwise, a general protection exception #GP(0) is gener-
ated. This instruction performs MSR-specific checks in the same manner as RDMSR.
Although RDMSRLIST accesses the entries in the two tables in order, the actual reads of the MSRs may be
performed out of order: for table entries m < n, the processor may read the MSR for entry n before reading the
MSR for entry m. (This may be true also for a sequence of executions of RDMSR.) Ordering is guaranteed if the
address of the IA32_BARRIER MSR (2FH) appears in the table of MSR addresses. Specifically, if IA32_BARRIER
appears at entry m, then the MSR read for any entry n with n > m will not occur until (1) all instructions prior to
RDMSRLIST have completed locally; and (2) MSRs have been read for all table entries before entry m.
The processor is allowed (but not required) to “load ahead” in the list. For example, it may cause a page fault for
an access to a table entry after the nth, despite the processor having read only n MSRs.2
Operation
DO WHILE RCX != 0
MSR_index := position of least significant bit set in RCX;
Load MSR_address_table_entry from 8 bytes at the linear address RSI + (MSR_index * 8);
IF MSR_address_table_entry[63:32] != 0 THEN #GP(0); FI;
MSR_address := MSR_address_table_entry[31:0];
IF RDMSR of the MSR with address MSR_address would #GP THEN #GP(0); FI;
Store the value of the MSR with address MSR_address into 8 bytes at the linear address RDI + (MSR_index * 8);
RCX[MSR_index] := 0;
Allow delivery of any pending interrupts or traps;
OD;
1. Since MSR addresses are only 32-bits wide, bits 63:32 of each MSR address table entry is reserved.
2. For example, the processor may take a page fault due to a linear address for the 10th entry in the MSR address table despite only
having completed the MSR reads up to entry 5.
Description
Reads the contents of the performance monitoring counter (PMC) specified in ECX register into registers EDX:EAX.
(On processors that support the Intel 64 architecture, the high-order 32 bits of RCX are ignored.) The EDX register
is loaded with the high-order 32 bits of the PMC and the EAX register is loaded with the low-order 32 bits. (On
processors that support the Intel 64 architecture, the high-order 32 bits of each of RAX and RDX are cleared.) If
fewer than 64 bits are implemented in the PMC being read, unimplemented bits returned to EDX:EAX will have
value zero.
The width of PMCs on processors supporting architectural performance monitoring (CPUID.0AH:EAX[7:0] ≠ 0) are
reported by CPUID.0AH:EAX[23:16]. On processors that do not support architectural performance monitoring
(CPUID.0AH:EAX[7:0]=0), the width of general-purpose performance PMCs is 40 bits, while the widths of special-
purpose PMCs are implementation specific.
Use of ECX to specify a PMC depends on whether the processor supports architectural performance monitoring:
• If the processor does not support architectural performance monitoring (CPUID.0AH:EAX[7:0]=0), ECX[30:0]
specifies the index of the PMC to be read. Setting ECX[31] selects “fast” read mode if supported. In this mode,
RDPMC returns bits 31:0 of the PMC in EAX while clearing EDX to zero.
• If the processor does support architectural performance monitoring (CPUID.0AH:EAX[7:0] ≠ 0), ECX[31:16]
specifies type of PMC while ECX[15:0] specifies the index of the PMC to be read within that type. The following
PMC types are currently defined:
— General-purpose counters use type 0. To read IA32_PMCx, one of the following must hold for the index x:
• It is less than the value enumerated by CPUID.0AH.EAX[15:8]; or
• It is at most 31 and the value enumerated by CPUID.(EAX=23H,ECX=1):EAX[bit x] is 1.
— Fixed-function counters use type 4000H. To read IA32_FIXED_CTRx, one of the following must hold for the
index x:
• It is less than the value enumerated by CPUID.0AH:EDX[4:0];
• It is at most 31 and the value enumerated by CPUID.0AH:ECX[bit x] is 1; or
• It is at most 31 and the value enumerated by CPUID.(EAX=23H,ECX=1):EBX[bit x] is 1.
— Performance metrics use type 2000H. This type can be used only if IA32_PERF_CAPABILITIES.PERF_MET-
RICS_AVAILABLE[bit 15]=1. For this type, the index in ECX[15:0] is implementation specific.
Specifying an unsupported PMC encoding will cause a general protection exception #GP(0). For PMC details see
Chapter 21, “Performance Monitoring,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3B.
When in protected or virtual 8086 mode, the Performance-monitoring Counters Enabled (PCE) flag in register
CR4 restricts the use of the RDPMC instruction. When the PCE flag is set, the RDPMC instruction can be executed
at any privilege level; when the flag is clear, the instruction can only be executed at privilege level 0. (When in real-
address mode, the RDPMC instruction is always enabled.) The PMCs can also be read with the RDMSR instruction,
when executing at privilege level 0.
Processors that support performance metrics may also support clearing them on read if the
IA32_PERF_CAPABILITIES.RDPMC_METRICS_CLEAR[bit 19] is set. Since the IA32_PERF_CAPABILITIES MSR
Operation
MSCB = Most Significant Counter Bit (* Model-specific *)
IF (((CR4.PCE = 1) or (CPL = 0) or (CR0.PE = 0)) and (ECX indicates a supported counter))
THEN
EAX := counter[31:0];
EDX := ZeroExtend(counter[MSCB:32]);
ELSE (* ECX is not valid or CR4.PCE is 0 and CPL is 1, 2, or 3 and CR0.PE is 1 *)
#GP(0);
FI;
Flags Affected
None.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Selects a double precision floating-point value of an input pair using a bit control and move to a designated element
of the destination operand. The low-to-high order of double precision element of the destination operand is inter-
leaved between the first source operand and the second source operand at the granularity of input pair of 128 bits.
Each bit in the imm8 byte, starting from bit 0, is the select control of the corresponding element of the destination
to received the shuffled result of an input pair.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
64-bit memory location The destination operand is a ZMM/YMM/XMM register updated according to the writemask.
The select controls are the lower 8/4/2 bits of the imm8 byte.
SHUFPD—Packed Interleave Shuffle of Pairs of Double Precision Floating-Point Values Vol. 2B 4-637
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. The select controls are the bit 3:0
of the imm8 byte, imm8[7:4) are ignored.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed. The select controls are the bit 1:0 of the imm8 byte,
imm8[7:2) are ignored.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation operand and the first source operand is the same and is an XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are unmodified. The select controls are the bit 1:0 of the imm8 byte,
imm8[7:2) are ignored.
SRC1 X3 X2 X1 X0
SRC2 Y3 Y2 Y1 Y0
DEST Y2 or Y3 X2 or X3 Y0 or Y1 X0 or X1
Figure 4-25. 256-bit VSHUFPD Operation of Four Pairs of Double Precision Floating-Point Values
Operation
VSHUFPD (EVEX Encoded Versions When SRC2 is a Vector Register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF IMM0[0] = 0
THEN TMP_DEST[63:0] := SRC1[63:0]
ELSE TMP_DEST[63:0] := SRC1[127:64] FI;
IF IMM0[1] = 0
THEN TMP_DEST[127:64] := SRC2[63:0]
ELSE TMP_DEST[127:64] := SRC2[127:64] FI;
IF VL >= 256
IF IMM0[2] = 0
THEN TMP_DEST[191:128] := SRC1[191:128]
ELSE TMP_DEST[191:128] := SRC1[255:192] FI;
IF IMM0[3] = 0
THEN TMP_DEST[255:192] := SRC2[191:128]
ELSE TMP_DEST[255:192] := SRC2[255:192] FI;
FI;
IF VL >= 512
IF IMM0[4] = 0
THEN TMP_DEST[319:256] := SRC1[319:256]
ELSE TMP_DEST[319:256] := SRC1[383:320] FI;
IF IMM0[5] = 0
THEN TMP_DEST[383:320] := SRC2[319:256]
ELSE TMP_DEST[383:320] := SRC2[383:320] FI;
IF IMM0[6] = 0
THEN TMP_DEST[447:384] := SRC1[447:384]
ELSE TMP_DEST[447:384] := SRC1[511:448] FI;
IF IMM0[7] = 0
SHUFPD—Packed Interleave Shuffle of Pairs of Double Precision Floating-Point Values Vol. 2B 4-638
THEN TMP_DEST[511:448] := SRC2[447:384]
ELSE TMP_DEST[511:448] := SRC2[511:448] FI;
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1)
THEN TMP_SRC2[i+63:i] := SRC2[63:0]
ELSE TMP_SRC2[i+63:i] := SRC2[i+63:i]
FI;
ENDFOR;
IF IMM0[0] = 0
THEN TMP_DEST[63:0] := SRC1[63:0]
ELSE TMP_DEST[63:0] := SRC1[127:64] FI;
IF IMM0[1] = 0
THEN TMP_DEST[127:64] := TMP_SRC2[63:0]
ELSE TMP_DEST[127:64] := TMP_SRC2[127:64] FI;
IF VL >= 256
IF IMM0[2] = 0
THEN TMP_DEST[191:128] := SRC1[191:128]
ELSE TMP_DEST[191:128] := SRC1[255:192] FI;
IF IMM0[3] = 0
THEN TMP_DEST[255:192] := TMP_SRC2[191:128]
ELSE TMP_DEST[255:192] := TMP_SRC2[255:192] FI;
FI;
IF VL >= 512
IF IMM0[4] = 0
THEN TMP_DEST[319:256] := SRC1[319:256]
ELSE TMP_DEST[319:256] := SRC1[383:320] FI;
IF IMM0[5] = 0
THEN TMP_DEST[383:320] := TMP_SRC2[319:256]
ELSE TMP_DEST[383:320] := TMP_SRC2[383:320] FI;
IF IMM0[6] = 0
THEN TMP_DEST[447:384] := SRC1[447:384]
ELSE TMP_DEST[447:384] := SRC1[511:448] FI;
IF IMM0[7] = 0
THEN TMP_DEST[511:448] := TMP_SRC2[447:384]
ELSE TMP_DEST[511:448] := TMP_SRC2[511:448] FI;
SHUFPD—Packed Interleave Shuffle of Pairs of Double Precision Floating-Point Values Vol. 2B 4-639
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
SHUFPD—Packed Interleave Shuffle of Pairs of Double Precision Floating-Point Values Vol. 2B 4-640
Intel C/C++ Compiler Intrinsic Equivalent
VSHUFPD __m512d _mm512_shuffle_pd(__m512d a, __m512d b, int imm);
VSHUFPD __m512d _mm512_mask_shuffle_pd(__m512d s, __mmask8 k, __m512d a, __m512d b, int imm);
VSHUFPD __m512d _mm512_maskz_shuffle_pd( __mmask8 k, __m512d a, __m512d b, int imm);
VSHUFPD __m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int select);
VSHUFPD __m256d _mm256_mask_shuffle_pd(__m256d s, __mmask8 k, __m256d a, __m256d b, int imm);
VSHUFPD __m256d _mm256_maskz_shuffle_pd( __mmask8 k, __m256d a, __m256d b, int imm);
SHUFPD __m128d _mm_shuffle_pd (__m128d a, __m128d b, const int select);
VSHUFPD __m128d _mm_mask_shuffle_pd(__m128d s, __mmask8 k, __m128d a, __m128d b, int imm);
VSHUFPD __m128d _mm_maskz_shuffle_pd( __mmask8 k, __m128d a, __m128d b, int imm);
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions.”
SHUFPD—Packed Interleave Shuffle of Pairs of Double Precision Floating-Point Values Vol. 2B 4-641
SHUFPS—Packed Interleave Shuffle of Quadruplets of Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F C6 /r ib A V/V SSE Select from quadruplet of single precision floating-
SHUFPS xmm1, xmm3/m128, imm8 point values in xmm1 and xmm2/m128 using
imm8, interleaved result pairs are stored in xmm1.
VEX.128.0F.WIG C6 /r ib B V/V AVX Select from quadruplet of single precision floating-
VSHUFPS xmm1, xmm2, xmm3/m128, point values in xmm1 and xmm2/m128 using
imm8 imm8, interleaved result pairs are stored in xmm1.
VEX.256.0F.WIG C6 /r ib B V/V AVX Select from quadruplet of single precision floating-
VSHUFPS ymm1, ymm2, ymm3/m256, point values in ymm2 and ymm3/m256 using
imm8 imm8, interleaved result pairs are stored in ymm1.
EVEX.128.0F.W0 C6 /r ib C V/V (AVX512VL AND Select from quadruplet of single precision floating-
VSHUFPS xmm1{k1}{z}, xmm2, AVX512F) OR point values in xmm1 and xmm2/m128 using
xmm3/m128/m32bcst, imm8 AVX10.11 imm8, interleaved result pairs are stored in xmm1,
subject to writemask k1.
EVEX.256.0F.W0 C6 /r ib C V/V (AVX512VL AND Select from quadruplet of single precision floating-
VSHUFPS ymm1{k1}{z}, ymm2, AVX512F) OR point values in ymm2 and ymm3/m256 using
ymm3/m256/m32bcst, imm8 AVX10.11 imm8, interleaved result pairs are stored in ymm1,
subject to writemask k1.
EVEX.512.0F.W0 C6 /r ib C V/V AVX512F Select from quadruplet of single precision floating-
VSHUFPS zmm1{k1}{z}, zmm2, OR AVX10.11 point values in zmm2 and zmm3/m512 using imm8,
zmm3/m512/m32bcst, imm8 interleaved result pairs are stored in zmm1, subject
to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Selects a single precision floating-point value of an input quadruplet using a two-bit control and move to a desig-
nated element of the destination operand. Each 64-bit element-pair of a 128-bit lane of the destination operand is
interleaved between the corresponding lane of the first source operand and the second source operand at the gran-
ularity 128 bits. Each two bits in the imm8 byte, starting from bit 0, is the select control of the corresponding
element of a 128-bit lane of the destination to received the shuffled result of an input quadruplet. The two lower
elements of a 128-bit lane in the destination receives shuffle results from the quadruple of the first source operand.
The next two elements of the destination receives shuffle results from the quadruple of the second source operand.
EVEX encoded versions: The first source operand is a ZMM/YMM/XMM register. The second source operand can be
a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
32-bit memory location. The destination operand is a ZMM/YMM/XMM register updated according to the writemask.
imm8[7:0] provides 4 select controls for each applicable 128-bit lane of the destination.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register. Imm8[7:0] provides 4 select
controls for the high and low 128-bit of the destination.
SHUFPS—Packed Interleave Shuffle of Quadruplets of Single Precision Floating-Point Values Vol. 2B 4-642
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed. Imm8[7:0] provides 4 select controls for each element of
the destination.
128-bit Legacy SSE version: The source can be an XMM register or an 128-bit memory location. The destination is
not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding ZMM
register destination are unmodified. Imm8[7:0] provides 4 select controls for each element of the destination.
SRC1 X7 X6 X5 X4 X3 X2 X1 X0
SRC2 Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0
Figure 4-26. 256-bit VSHUFPS Operation of Selection from Input Quadruplet and Pair-wise Interleaved Result
Operation
Select4(SRC, control) {
CASE (control[1:0]) OF
0: TMP := SRC[31:0];
1: TMP := SRC[63:32];
2: TMP := SRC[95:64];
3: TMP := SRC[127:96];
ESAC;
RETURN TMP
}
SHUFPS—Packed Interleave Shuffle of Quadruplets of Single Precision Floating-Point Values Vol. 2B 4-643
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
SHUFPS—Packed Interleave Shuffle of Quadruplets of Single Precision Floating-Point Values Vol. 2B 4-644
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions.”
SHUFPS—Packed Interleave Shuffle of Quadruplets of Single Precision Floating-Point Values Vol. 2B 4-645
SQRTPD—Square Root of Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 51 /r A V/V SSE2 Computes Square Roots of the packed double precision
SQRTPD xmm1, xmm2/m128 floating-point values in xmm2/m128 and stores the
result in xmm1.
VEX.128.66.0F.WIG 51 /r A V/V AVX Computes Square Roots of the packed double precision
VSQRTPD xmm1, xmm2/m128 floating-point values in xmm2/m128 and stores the
result in xmm1.
VEX.256.66.0F.WIG 51 /r A V/V AVX Computes Square Roots of the packed double precision
VSQRTPD ymm1, ymm2/m256 floating-point values in ymm2/m256 and stores the
result in ymm1.
EVEX.128.66.0F.W1 51 /r B V/V (AVX512VL AND Computes Square Roots of the packed double precision
VSQRTPD xmm1 {k1}{z}, AVX512F) OR floating-point values in xmm2/m128/m64bcst and
xmm2/m128/m64bcst AVX10.11 stores the result in xmm1 subject to writemask k1.
EVEX.256.66.0F.W1 51 /r B V/V (AVX512VL AND Computes Square Roots of the packed double precision
VSQRTPD ymm1 {k1}{z}, AVX512F) OR floating-point values in ymm2/m256/m64bcst and
ymm2/m256/m64bcst AVX10.11 stores the result in ymm1 subject to writemask k1.
EVEX.512.66.0F.W1 51 /r B V/V AVX512F OR Computes Square Roots of the packed double precision
VSQRTPD zmm1 {k1}{z}, AVX10.11 floating-point values in zmm2/m512/m64bcst and
zmm2/m512/m64bcst{er} stores the result in zmm1 subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD computation of the square roots of the two, four or eight packed double precision floating-point
values in the source operand (the second operand) stores the packed double precision floating-point results in the
destination operand (the first operand).
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or
a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is a
ZMM/YMM/XMM register updated according to the writemask.
VEX.256 encoded version: The source operand is a YMM register or a 256-bit memory location. The destination
operand is a YMM register. The upper bits (MAXVL-1:256) of the corresponding ZMM register destination are
zeroed.
VEX.128 encoded version: the source operand second source operand or a 128-bit memory location. The destina-
tion operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or 128-bit memory location. The destina-
tion is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding ZMM
register destination are unmodified.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions,” additionally:
#UD If EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD computation of the square roots of the four, eight or sixteen packed single precision floating-point
values in the source operand (second operand) stores the packed single precision floating-point results in the desti-
nation operand.
EVEX.512 encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location
or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a
ZMM/YMM/XMM register updated according to the writemask.
VEX.256 encoded version: The source operand is a YMM register or a 256-bit memory location. The destination
operand is a YMM register. The upper bits (MAXVL-1:256) of the corresponding ZMM register destination are
zeroed.
VEX.128 encoded version: the source operand second source operand or a 128-bit memory location. The destina-
tion operand is an XMM register. The upper bits (MAXVL-1:128) of the corresponding ZMM register destination are
zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or 128-bit memory location. The destina-
tion is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding ZMM
register destination are unmodified.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-19, “Type 2 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions,” additionally:
#UD If EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Computes the square root of the low double precision floating-point value in the second source operand and stores
the double precision floating-point result in the destination operand. The second source operand can be an XMM
register or a 64-bit memory location. The first source and destination operands are XMM registers.
128-bit Legacy SSE version: The first source operand and the destination operand are the same. The quadword at
bits 127:64 of the destination operand remains unchanged. Bits (MAXVL-1:64) of the corresponding destination
register remain unchanged.
VEX.128 and EVEX encoded versions: Bits 127:64 of the destination operand are copied from the corresponding
bits of the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
EVEX encoded version: The low quadword element of the destination operand is updated according to the write-
mask.
Software should ensure VSQRTSD is encoded with VEX.L=0. Encoding VSQRTSD with VEX.L=1 may encounter
unpredictable behavior across different processor generations.
SQRTSD—Compute Square Root of Scalar Double Precision Floating-Point Value Vol. 2B 4-658
Operation
VSQRTSD (EVEX Encoded Version)
IF (EVEX.b = 1) AND (SRC2 *is register*)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := SQRT(SRC2[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”
SQRTSD—Compute Square Root of Scalar Double Precision Floating-Point Value Vol. 2B 4-659
SQRTSS—Compute Square Root of Scalar Single Precision Value
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
F3 0F 51 /r A V/V SSE Computes square root of the low single precision floating-
SQRTSS xmm1, xmm2/m32 point value in xmm2/m32 and stores the results in xmm1.
VEX.LIG.F3.0F.WIG 51 /r B V/V AVX Computes square root of the low single precision floating-
VSQRTSS xmm1, xmm2, point value in xmm3/m32 and stores the results in xmm1.
xmm3/m32 Also, upper single precision floating-point values
(bits[127:32]) from xmm2 are copied to xmm1[127:32].
EVEX.LLIG.F3.0F.W0 51 /r C V/V AVX512F Computes square root of the low single precision floating-
VSQRTSS xmm1 {k1}{z}, xmm2, OR AVX10.11 point value in xmm3/m32 and stores the results in xmm1
xmm3/m32{er} under writemask k1. Also, upper single precision floating-
point values (bits[127:32]) from xmm2 are copied to
xmm1[127:32].
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Computes the square root of the low single precision floating-point value in the second source operand and stores
the single precision floating-point result in the destination operand. The second source operand can be an XMM
register or a 32-bit memory location. The first source and destination operands is an XMM register.
128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (MAXVL-
1:32) of the corresponding YMM destination register remain unchanged.
VEX.128 and EVEX encoded versions: Bits 127:32 of the destination operand are copied from the corresponding
bits of the first source operand. Bits (MAXVL-1:128) of the destination ZMM register are zeroed.
EVEX encoded version: The low doubleword element of the destination operand is updated according to the write-
mask.
Software should ensure VSQRTSS is encoded with VEX.L=0. Encoding VSQRTSS with VEX.L=1 may encounter
unpredictable behavior across different processor generations.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD subtract of the two, four or eight packed double precision floating-point values of the second
Source operand from the first Source operand, and stores the packed double precision floating-point results in the
destination operand.
VEX.128 and EVEX.128 encoded versions: The second source operand is an XMM register or an 128-bit memory
location. The first source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the corre-
sponding destination register are zeroed.
VEX.256 and EVEX.256 encoded versions: The second source operand is an YMM register or an 256-bit memory
location. The first source operand and destination operands are YMM registers. Bits (MAXVL-1:256) of the corre-
sponding destination register are zeroed.
EVEX.512 encoded version: The second source operand is a ZMM register, a 512-bit memory location or a 512-bit
vector broadcasted from a 64-bit memory location. The first source operand and destination operands are ZMM
registers. The destination operand is conditionally updated according to the writemask.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper Bits (MAXVL-1:128) of the corresponding
register destination are unmodified.
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC1[i+63:i] - SRC2[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
DEST[63:0] := 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1)
THEN DEST[i+63:i] := SRC1[i+63:i] - SRC2[63:0];
ELSE EST[i+63:i] := SRC1[i+63:i] - SRC2[i+63:i];
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
DEST[63:0] := 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD subtract of the packed single precision floating-point values in the second Source operand from
the First Source operand, and stores the packed single precision floating-point results in the destination operand.
VEX.128 and EVEX.128 encoded versions: The second source operand is an XMM register or an 128-bit memory
location. The first source operand and destination operands are XMM registers. Bits (MAXVL-1:128) of the corre-
sponding destination register are zeroed.
VEX.256 and EVEX.256 encoded versions: The second source operand is an YMM register or an 256-bit memory
location. The first source operand and destination operands are YMM registers. Bits (MAXVL-1:256) of the corre-
sponding destination register are zeroed.
EVEX.512 encoded version: The second source operand is a ZMM register, a 512-bit memory location or a 512-bit
vector broadcasted from a 32-bit memory location. The first source operand and destination operands are ZMM
registers. The destination operand is conditionally updated according to the writemask.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper Bits (MAXVL-1:128) of the corresponding
register destination are unmodified.
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := SRC1[i+31:i] - SRC2[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
DEST[31:0] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
DEST[31:0] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Subtract the low double precision floating-point value in the second source operand from the first source operand
and stores the double precision floating-point result in the low quadword of the destination operand.
The second source operand can be an XMM register or a 64-bit memory location. The first source and destination
operands are XMM registers.
128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL-1:64) of the
corresponding destination register remain unchanged.
VEX.128 and EVEX encoded versions: Bits (127:64) of the XMM register destination are copied from corresponding
bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
EVEX encoded version: The low quadword element of the destination operand is updated according to the write-
mask.
Software should ensure VSUBSD is encoded with VEX.L=0. Encoding VSUBSD with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Subtract the low single precision floating-point value from the second source operand and the first source operand
and store the double precision floating-point result in the low doubleword of the destination operand.
The second source operand can be an XMM register or a 32-bit memory location. The first source and destination
operands are XMM registers.
128-bit Legacy SSE version: The destination and first source operand are the same. Bits (MAXVL-1:32) of the
corresponding destination register remain unchanged.
VEX.128 and EVEX encoded versions: Bits (127:32) of the XMM register destination are copied from corresponding
bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
EVEX encoded version: The low doubleword element of the destination operand is updated according to the write-
mask.
Software should ensure VSUBSS is encoded with VEX.L=0. Encoding VSUBSD with VEX.L=1 may encounter unpre-
dictable behavior across different processor generations.
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
Description
This instruction performs a set of SIMD dot-products of two FP16 elements and accumulates the results into a
packed single precision tile. Each dword element in input tiles tmm2 and tmm3 is interpreted as a FP16 pair. For
each possible combination of (row of tmm2, column of tmm3), the instruction performs a set of SIMD dot-products
on all corresponding FP16 pairs (one pair from tmm2 and one pair from tmm3), adds the results of those dot-prod-
ucts, and then accumulates the result into the corresponding row and column of tmm1.
“Round to nearest even” rounding mode is used when doing each accumulation of the Fused Multiply-Add (FMA).
Output FP32 denormals are always flushed to zero. Input FP16 denormals are always handled and not treated as
zero.
MXCSR is not consulted nor updated.
Any attempt to execute the TDPFP16PS instruction inside an Intel TSX transaction will result in a transaction abort.
Operation
TDPFP16PS tsrcdest, tsrc1, tsrc2
// C = m x n (tsrcdest), A = m x k (tsrc1), B = k x n (tsrc2)
TDPFP16PS—Dot Product of FP16 Tiles Accumulated into Packed Single Precision Tile Vol. 2B 4-711
// No exceptions raised or denoted.
tmpf32 := temp1.fp32[2*n] + temp1.fp32[2*n+1]
srcdest.row[m].fp32[n] := srcdest.row[m].fp32[n] + tmpf32
write_row_and_zero(tsrcdest, m, tmp, tsrcdest.colsb)
zero_upper_rows(tsrcdest, tsrcdest.rows)
zero_tileconfig_start()
Flags Affected
None.
Exceptions
AMX-E4; see Section 3.6, “Exception Classes” for details.
TDPFP16PS—Dot Product of FP16 Tiles Accumulated into Packed Single Precision Tile Vol. 2B 4-712
UCOMISD—Unordered Compare Scalar Double Precision Floating-Point Values and Set EFLAGS
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 2E /r A V/V SSE2 Compare low double precision floating-point values in
UCOMISD xmm1, xmm2/m64 xmm1 and xmm2/mem64 and set the EFLAGS flags
accordingly.
VEX.LIG.66.0F.WIG 2E /r A V/V AVX Compare low double precision floating-point values in
VUCOMISD xmm1, xmm2/m64 xmm1 and xmm2/mem64 and set the EFLAGS flags
accordingly.
EVEX.LLIG.66.0F.W1 2E /r B V/V AVX512F Compare low double precision floating-point values in
VUCOMISD xmm1, xmm2/m64{sae} OR AVX10.11 xmm1 and xmm2/m64 and set the EFLAGS flags
accordingly.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs an unordered compare of the double precision floating-point values in the low quadwords of operand 1
(first operand) and operand 2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according
to the result (unordered, greater than, less than, or equal). The OF, SF, and AF flags in the EFLAGS register are set
to 0. The unordered result is returned if either source operand is a NaN (QNaN or SNaN).
Operand 1 is an XMM register; operand 2 can be an XMM register or a 64 bit memory
location.
The UCOMISD instruction differs from the COMISD instruction in that it signals a SIMD floating-point invalid oper-
ation exception (#I) only when a source operand is an SNaN. The COMISD instruction signals an invalid operation
exception only if a source operand is either an SNaN or a QNaN.
The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
Software should ensure VCOMISD is encoded with VEX.L=0. Encoding VCOMISD with VEX.L=1 may encounter
unpredictable behavior across different processor generations.
Operation
(V)UCOMISD (All Versions)
RESULT := UnorderedCompare(DEST[63:0] <> SRC[63:0]) {
(* Set EFLAGS *) CASE (RESULT) OF
UNORDERED: ZF,PF,CF := 111;
GREATER_THAN: ZF,PF,CF := 000;
LESS_THAN: ZF,PF,CF := 001;
EQUAL: ZF,PF,CF := 100;
ESAC;
OF, AF, SF := 0; }
UCOMISD—Unordered Compare Scalar Double Precision Floating-Point Values and Set EFLAGS Vol. 2B 4-718
Intel C/C++ Compiler Intrinsic Equivalent
VUCOMISD int _mm_comi_round_sd(__m128d a, __m128d b, int imm, int sae);
UCOMISD int _mm_ucomieq_sd(__m128d a, __m128d b)
UCOMISD int _mm_ucomilt_sd(__m128d a, __m128d b)
UCOMISD int _mm_ucomile_sd(__m128d a, __m128d b)
UCOMISD int _mm_ucomigt_sd(__m128d a, __m128d b)
UCOMISD int _mm_ucomige_sd(__m128d a, __m128d b)
UCOMISD int _mm_ucomineq_sd(__m128d a, __m128d b)
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
UCOMISD—Unordered Compare Scalar Double Precision Floating-Point Values and Set EFLAGS Vol. 2B 4-719
UCOMISS—Unordered Compare Scalar Single Precision Floating-Point Values and Set EFLAGS
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 2E /r A V/V SSE Compare low single precision floating-point values in
UCOMISS xmm1, xmm2/m32 xmm1 and xmm2/mem32 and set the EFLAGS flags
accordingly.
VEX.LIG.0F.WIG 2E /r A V/V AVX Compare low single precision floating-point values in
VUCOMISS xmm1, xmm2/m32 xmm1 and xmm2/mem32 and set the EFLAGS flags
accordingly.
EVEX.LLIG.0F.W0 2E /r B V/V AVX512F Compare low single precision floating-point values in
VUCOMISS xmm1, xmm2/m32{sae} OR AVX10.11 xmm1 and xmm2/mem32 and set the EFLAGS flags
accordingly.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Compares the single precision floating-point values in the low doublewords of operand 1 (first operand) and
operand 2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unor-
dered, greater than, less than, or equal). The OF, SF, and AF flags in the EFLAGS register are set to 0. The unor-
dered result is returned if either source operand is a NaN (QNaN or SNaN).
Operand 1 is an XMM register; operand 2 can be an XMM register or a 32 bit memory location.
The UCOMISS instruction differs from the COMISS instruction in that it signals a SIMD floating-point invalid opera-
tion exception (#I) only if a source operand is an SNaN. The COMISS instruction signals an invalid operation excep-
tion when a source operand is either a QNaN or SNaN.
The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b, otherwise instructions will #UD.
Software should ensure VCOMISS is encoded with VEX.L=0. Encoding VCOMISS with VEX.L=1 may encounter
unpredictable behavior across different processor generations.
Operation
(V)UCOMISS (All Versions)
RESULT := UnorderedCompare(DEST[31:0] <> SRC[31:0]) {
(* Set EFLAGS *) CASE (RESULT) OF
UNORDERED: ZF,PF,CF := 111;
GREATER_THAN: ZF,PF,CF := 000;
LESS_THAN: ZF,PF,CF := 001;
EQUAL: ZF,PF,CF := 100;
ESAC;
OF, AF, SF := 0; }
UCOMISS—Unordered Compare Scalar Single Precision Floating-Point Values and Set EFLAGS Vol. 2B 4-720
Intel C/C++ Compiler Intrinsic Equivalent
VUCOMISS int _mm_comi_round_ss(__m128 a, __m128 b, int imm, int sae);
UCOMISS int _mm_ucomieq_ss(__m128 a, __m128 b);
UCOMISS int _mm_ucomilt_ss(__m128 a, __m128 b);
UCOMISS int _mm_ucomile_ss(__m128 a, __m128 b);
UCOMISS int _mm_ucomigt_ss(__m128 a, __m128 b);
UCOMISS int _mm_ucomige_ss(__m128 a, __m128 b);
UCOMISS int _mm_ucomineq_ss(__m128 a, __m128 b);
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions,” additionally:
#UD If VEX.vvvv != 1111B.
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
UCOMISS—Unordered Compare Scalar Single Precision Floating-Point Values and Set EFLAGS Vol. 2B 4-721
UIRET—User-Interrupt Return
Opcode/ Op/ 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
F3 0F 01 EC ZO V/I UINTR Return from handling a user interrupt.
UIRET
Description
UIRET returns from the handling of a user interrupt. It can be executed regardless of CPL.
Execution of UIRET inside a transactional region causes a transactional abort; the abort loads EAX as it would have
had it been due to an execution of IRET.
UIRET can be tracked by Architectural Last Branch Records (LBRs), Intel Processor Trace (Intel PT), and Perfor-
mance Monitoring. For both Intel PT and LBRs, UIRET is recorded in precisely the same manner as IRET. Hence for
LBRs, UIRETs fall into the OTHER_BRANCH category, which implies that IA32_LBR_CTL.OTHER_BRANCH[bit 22]
must be set to record user-interrupt delivery, and that the IA32_LBR_x_INFO.BR_TYPE field will indicate
OTHER_BRANCH for any recorded user interrupt. For Intel PT, control flow tracing must be enabled by setting
IA32_RTIT_CTL.BranchEn[bit 13].
UIRET will also increment performance counters for which counting BR_INST_RETIRED.FAR_BRANCH is enabled.
Operation
Pop tempRIP;
Pop tempRFLAGS; // see below for how this is used to load RFLAGS
Pop tempRSP;
IF tempRIP is not canonical in current paging mode
THEN #GP(0);
FI;
IF ShadowStackEnabled(CPL)
THEN
PopShadowStack SSRIP;
IF SSRIP ≠ tempRIP
THEN #CP (FAR-RET/IRET);
FI;
FI;
RIP := tempRIP;
// update in RFLAGS only CF, PF, AF, ZF, SF, TF, DF, OF, NT, RF, AC, and ID
RFLAGS := (RFLAGS & ~254DD5H) | (tempRFLAGS & 254DD5H);
RSP := tempRSP;
IF CPUID.(EAX=07H, ECX=01H):EDX.UIRET_UIF[bit 17] = 1
THEN UIF := tempRFLAGS[1];
ELSE UIF := 1;
FI;
Clear any cache-line monitoring established by MONITOR or UMONITOR;
Flags Affected
See the Operation section.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs an interleaved unpack of the high double precision floating-point values from the first source operand and
the second source operand. See Figure 4-15 in the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 2B.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified. When unpacking from a memory operand, an implementation may fetch
only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be
enforced.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register.
EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM
register, a 512-bit memory location, or a 512-bit vector broadcasted from a 64-bit memory location. The destina-
tion operand is a ZMM register, conditionally updated using writemask k1.
UNPCKHPD—Unpack and Interleave High Packed Double Precision Floating-Point Values Vol. 2B 4-729
EVEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM
register, a 256-bit memory location, or a 256-bit vector broadcasted from a 64-bit memory location. The destina-
tion operand is a YMM register, conditionally updated using writemask k1.
EVEX.128 encoded version: The first source operand is a XMM register. The second source operand is a XMM
register, a 128-bit memory location, or a 128-bit vector broadcasted from a 64-bit memory location. The destina-
tion operand is a XMM register, conditionally updated using writemask k1.
Operation
VUNPCKHPD (EVEX Encoded Versions When SRC2 is a Register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF VL >= 128
TMP_DEST[63:0] := SRC1[127:64]
TMP_DEST[127:64] := SRC2[127:64]
FI;
IF VL >= 256
TMP_DEST[191:128] := SRC1[255:192]
TMP_DEST[255:192] := SRC2[255:192]
FI;
IF VL >= 512
TMP_DEST[319:256] := SRC1[383:320]
TMP_DEST[383:320] := SRC2[383:320]
TMP_DEST[447:384] := SRC1[511:448]
TMP_DEST[511:448] := SRC2[511:448]
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
UNPCKHPD—Unpack and Interleave High Packed Double Precision Floating-Point Values Vol. 2B 4-730
VUNPCKHPD (EVEX Encoded Version When SRC2 is Memory)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1)
THEN TMP_SRC2[i+63:i] := SRC2[63:0]
ELSE TMP_SRC2[i+63:i] := SRC2[i+63:i]
FI;
ENDFOR;
IF VL >= 128
TMP_DEST[63:0] := SRC1[127:64]
TMP_DEST[127:64] := TMP_SRC2[127:64]
FI;
IF VL >= 256
TMP_DEST[191:128] := SRC1[255:192]
TMP_DEST[255:192] := TMP_SRC2[255:192]
FI;
IF VL >= 512
TMP_DEST[319:256] := SRC1[383:320]
TMP_DEST[383:320] := TMP_SRC2[383:320]
TMP_DEST[447:384] := SRC1[511:448]
TMP_DEST[511:448] := TMP_SRC2[511:448]
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
UNPCKHPD—Unpack and Interleave High Packed Double Precision Floating-Point Values Vol. 2B 4-731
Intel C/C++ Compiler Intrinsic Equivalent
VUNPCKHPD __m512d _mm512_unpackhi_pd( __m512d a, __m512d b);
VUNPCKHPD __m512d _mm512_mask_unpackhi_pd(__m512d s, __mmask8 k, __m512d a, __m512d b);
VUNPCKHPD __m512d _mm512_maskz_unpackhi_pd(__mmask8 k, __m512d a, __m512d b);
VUNPCKHPD __m256d _mm256_unpackhi_pd(__m256d a, __m256d b)
VUNPCKHPD __m256d _mm256_mask_unpackhi_pd(__m256d s, __mmask8 k, __m256d a, __m256d b);
VUNPCKHPD __m256d _mm256_maskz_unpackhi_pd(__mmask8 k, __m256d a, __m256d b);
UNPCKHPD __m128d _mm_unpackhi_pd(__m128d a, __m128d b)
VUNPCKHPD __m128d _mm_mask_unpackhi_pd(__m128d s, __mmask8 k, __m128d a, __m128d b);
VUNPCKHPD __m128d _mm_maskz_unpackhi_pd(__mmask8 k, __m128d a, __m128d b);
Other Exceptions
Non-EVEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-52, “Type E4NF Class Exception Conditions.”
UNPCKHPD—Unpack and Interleave High Packed Double Precision Floating-Point Values Vol. 2B 4-732
UNPCKHPS—Unpack and Interleave High Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 15 /r A V/V SSE Unpacks and Interleaves single precision floating-point
UNPCKHPS xmm1, xmm2/m128 values from high quadwords of xmm1 and xmm2/m128.
VEX.128.0F.WIG 15 /r B V/V AVX Unpacks and Interleaves single precision floating-point
VUNPCKHPS xmm1, xmm2, values from high quadwords of xmm2 and xmm3/m128.
xmm3/m128
VEX.256.0F.WIG 15 /r B V/V AVX Unpacks and Interleaves single precision floating-point
VUNPCKHPS ymm1, ymm2, values from high quadwords of ymm2 and ymm3/m256.
ymm3/m256
EVEX.128.0F.W0 15 /r C V/V (AVX512VL AND Unpacks and Interleaves single precision floating-point
VUNPCKHPS xmm1 {k1}{z}, xmm2, AVX512F) OR values from high quadwords of xmm2 and
xmm3/m128/m32bcst AVX10.11 xmm3/m128/m32bcst and write result to xmm1
subject to writemask k1.
EVEX.256.0F.W0 15 /r C V/V (AVX512VL AND Unpacks and Interleaves single precision floating-point
VUNPCKHPS ymm1 {k1}{z}, ymm2, AVX512F) OR values from high quadwords of ymm2 and
ymm3/m256/m32bcst AVX10.11 ymm3/m256/m32bcst and write result to ymm1
subject to writemask k1.
EVEX.512.0F.W0 15 /r C V/V AVX512F Unpacks and Interleaves single precision floating-point
VUNPCKHPS zmm1 {k1}{z}, zmm2, OR AVX10.11 values from high quadwords of zmm2 and
zmm3/m512/m32bcst zmm3/m512/m32bcst and write result to zmm1 subject
to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs an interleaved unpack of the high single precision floating-point values from the first source operand and
the second source operand.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified. When unpacking from a memory operand, an implementation may fetch
only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be
enforced.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
VEX.256 encoded version: The second source operand is an YMM register or an 256-bit memory location. The first
source operand and destination operands are YMM registers.
UNPCKHPS—Unpack and Interleave High Packed Single Precision Floating-Point Values Vol. 2B 4-733
SRC1 X7 X6 X5 X4 X3 X2 X1 X0
SRC2 Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0
DEST Y7 X7 Y6 X6 Y3 X3 Y2 X2
EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM
register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a ZMM register, conditionally updated using writemask k1.
EVEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM
register, a 256-bit memory location, or a 256-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a YMM register, conditionally updated using writemask k1.
EVEX.128 encoded version: The first source operand is a XMM register. The second source operand is a XMM
register, a 128-bit memory location, or a 128-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a XMM register, conditionally updated using writemask k1.
Operation
VUNPCKHPS (EVEX Encoded Version When SRC2 is a Register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF VL >= 128
TMP_DEST[31:0] := SRC1[95:64]
TMP_DEST[63:32] := SRC2[95:64]
TMP_DEST[95:64] := SRC1[127:96]
TMP_DEST[127:96] := SRC2[127:96]
FI;
IF VL >= 256
TMP_DEST[159:128] := SRC1[223:192]
TMP_DEST[191:160] := SRC2[223:192]
TMP_DEST[223:192] := SRC1[255:224]
TMP_DEST[255:224] := SRC2[255:224]
FI;
IF VL >= 512
TMP_DEST[287:256] := SRC1[351:320]
TMP_DEST[319:288] := SRC2[351:320]
TMP_DEST[351:320] := SRC1[383:352]
TMP_DEST[383:352] := SRC2[383:352]
TMP_DEST[415:384] := SRC1[479:448]
TMP_DEST[447:416] := SRC2[479:448]
TMP_DEST[479:448] := SRC1[511:480]
TMP_DEST[511:480] := SRC2[511:480]
FI;
UNPCKHPS—Unpack and Interleave High Packed Single Precision Floating-Point Values Vol. 2B 4-734
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
UNPCKHPS—Unpack and Interleave High Packed Single Precision Floating-Point Values Vol. 2B 4-735
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
Non-EVEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-52, “Type E4NF Class Exception Conditions.”
UNPCKHPS—Unpack and Interleave High Packed Single Precision Floating-Point Values Vol. 2B 4-736
UNPCKLPD—Unpack and Interleave Low Packed Double Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
66 0F 14 /r A V/V SSE2 Unpacks and Interleaves double precision floating-point
UNPCKLPD xmm1, xmm2/m128 values from low quadwords of xmm1 and xmm2/m128.
VEX.128.66.0F.WIG 14 /r B V/V AVX Unpacks and Interleaves double precision floating-point
VUNPCKLPD xmm1,xmm2, values from low quadwords of xmm2 and xmm3/m128.
xmm3/m128
VEX.256.66.0F.WIG 14 /r B V/V AVX Unpacks and Interleaves double precision floating-point
VUNPCKLPD ymm1,ymm2, values from low quadwords of ymm2 and ymm3/m256.
ymm3/m256
EVEX.128.66.0F.W1 14 /r C V/V (AVX512VL AND Unpacks and Interleaves double precision floating-point
VUNPCKLPD xmm1 {k1}{z}, xmm2, AVX512F) OR values from low quadwords of xmm2 and
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst subject to write mask k1.
EVEX.256.66.0F.W1 14 /r C V/V (AVX512VL AND Unpacks and Interleaves double precision floating-point
VUNPCKLPD ymm1 {k1}{z}, ymm2, AVX512F) OR values from low quadwords of ymm2 and
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst subject to write mask k1.
EVEX.512.66.0F.W1 14 /r C V/V AVX512F Unpacks and Interleaves double precision floating-point
VUNPCKLPD zmm1 {k1}{z}, zmm2, OR AVX10.11 values from low quadwords of zmm2 and
zmm3/m512/m64bcst zmm3/m512/m64bcst subject to write mask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs an interleaved unpack of the low double precision floating-point values from the first source operand and
the second source operand.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified. When unpacking from a memory operand, an implementation may fetch
only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be
enforced.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register.
EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM
register, a 512-bit memory location, or a 512-bit vector broadcasted from a 64-bit memory location. The destina-
tion operand is a ZMM register, conditionally updated using writemask k1.
UNPCKLPD—Unpack and Interleave Low Packed Double Precision Floating-Point Values Vol. 2B 4-737
EVEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM
register, a 256-bit memory location, or a 256-bit vector broadcasted from a 64-bit memory location. The destina-
tion operand is a YMM register, conditionally updated using writemask k1.
EVEX.128 encoded version: The first source operand is an XMM register. The second source operand is a XMM
register, a 128-bit memory location, or a 128-bit vector broadcasted from a 64-bit memory location. The destina-
tion operand is a XMM register, conditionally updated using writemask k1.
Operation
VUNPCKLPD (EVEX Encoded Versions When SRC2 is a Register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF VL >= 128
TMP_DEST[63:0] := SRC1[63:0]
TMP_DEST[127:64] := SRC2[63:0]
FI;
IF VL >= 256
TMP_DEST[191:128] := SRC1[191:128]
TMP_DEST[255:192] := SRC2[191:128]
FI;
IF VL >= 512
TMP_DEST[319:256] := SRC1[319:256]
TMP_DEST[383:320] := SRC2[319:256]
TMP_DEST[447:384] := SRC1[447:384]
TMP_DEST[511:448] := SRC2[447:384]
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
UNPCKLPD—Unpack and Interleave Low Packed Double Precision Floating-Point Values Vol. 2B 4-738
VUNPCKLPD (EVEX Encoded Version When SRC2 is Memory)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1)
THEN TMP_SRC2[i+63:i] := SRC2[63:0]
ELSE TMP_SRC2[i+63:i] := SRC2[i+63:i]
FI;
ENDFOR;
IF VL >= 128
TMP_DEST[63:0] := SRC1[63:0]
TMP_DEST[127:64] := TMP_SRC2[63:0]
FI;
IF VL >= 256
TMP_DEST[191:128] := SRC1[191:128]
TMP_DEST[255:192] := TMP_SRC2[191:128]
FI;
IF VL >= 512
TMP_DEST[319:256] := SRC1[319:256]
TMP_DEST[383:320] := TMP_SRC2[319:256]
TMP_DEST[447:384] := SRC1[447:384]
TMP_DEST[511:448] := TMP_SRC2[447:384]
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
UNPCKLPD—Unpack and Interleave Low Packed Double Precision Floating-Point Values Vol. 2B 4-739
Intel C/C++ Compiler Intrinsic Equivalent
VUNPCKLPD __m512d _mm512_unpacklo_pd( __m512d a, __m512d b);
VUNPCKLPD __m512d _mm512_mask_unpacklo_pd(__m512d s, __mmask8 k, __m512d a, __m512d b);
VUNPCKLPD __m512d _mm512_maskz_unpacklo_pd(__mmask8 k, __m512d a, __m512d b);
VUNPCKLPD __m256d _mm256_unpacklo_pd(__m256d a, __m256d b)
VUNPCKLPD __m256d _mm256_mask_unpacklo_pd(__m256d s, __mmask8 k, __m256d a, __m256d b);
VUNPCKLPD __m256d _mm256_maskz_unpacklo_pd(__mmask8 k, __m256d a, __m256d b);
UNPCKLPD __m128d _mm_unpacklo_pd(__m128d a, __m128d b)
VUNPCKLPD __m128d _mm_mask_unpacklo_pd(__m128d s, __mmask8 k, __m128d a, __m128d b);
VUNPCKLPD __m128d _mm_maskz_unpacklo_pd(__mmask8 k, __m128d a, __m128d b);
Other Exceptions
Non-EVEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-52, “Type E4NF Class Exception Conditions.”
UNPCKLPD—Unpack and Interleave Low Packed Double Precision Floating-Point Values Vol. 2B 4-740
UNPCKLPS—Unpack and Interleave Low Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 14 /r A V/V SSE Unpacks and Interleaves single precision floating-point
UNPCKLPS xmm1, xmm2/m128 values from low quadwords of xmm1 and xmm2/m128.
VEX.128.0F.WIG 14 /r B V/V AVX Unpacks and Interleaves single precision floating-point
VUNPCKLPS xmm1,xmm2, values from low quadwords of xmm2 and xmm3/m128.
xmm3/m128
VEX.256.0F.WIG 14 /r B V/V AVX Unpacks and Interleaves single precision floating-point
VUNPCKLPS values from low quadwords of ymm2 and ymm3/m256.
ymm1,ymm2,ymm3/m256
EVEX.128.0F.W0 14 /r C V/V (AVX512VL AND Unpacks and Interleaves single precision floating-point
VUNPCKLPS xmm1 {k1}{z}, xmm2, AVX512F) OR values from low quadwords of xmm2 and xmm3/mem
xmm3/m128/m32bcst AVX10.11 and write result to xmm1 subject to write mask k1.
EVEX.256.0F.W0 14 /r C V/V (AVX512VL AND Unpacks and Interleaves single precision floating-point
VUNPCKLPS ymm1 {k1}{z}, ymm2, AVX512F) OR values from low quadwords of ymm2 and ymm3/mem
ymm3/m256/m32bcst AVX10.11 and write result to ymm1 subject to write mask k1.
EVEX.512.0F.W0 14 /r C V/V AVX512F Unpacks and Interleaves single precision floating-point
VUNPCKLPS zmm1 {k1}{z}, zmm2, OR AVX10.11 values from low quadwords of zmm2 and
zmm3/m512/m32bcst zmm3/m512/m32bcst and write result to zmm1
subject to write mask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs an interleaved unpack of the low single precision floating-point values from the first source operand and
the second source operand.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
ZMM register destination are unmodified. When unpacking from a memory operand, an implementation may fetch
only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be
enforced.
VEX.128 encoded version: The first source operand is a XMM register. The second source operand can be a XMM
register or a 128-bit memory location. The destination operand is a XMM register. The upper bits (MAXVL-1:128) of
the corresponding ZMM register destination are zeroed.
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM
register or a 256-bit memory location. The destination operand is a YMM register.
UNPCKLPS—Unpack and Interleave Low Packed Single Precision Floating-Point Values Vol. 2B 4-741
SRC1 X7 X6 X5 X4 X3 X2 X1 X0
SRC2 Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0
DEST Y5 X5 Y4 X4 Y1 X1 Y0 X0
EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM
register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a ZMM register, conditionally updated using writemask k1.
EVEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM
register, a 256-bit memory location, or a 256-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a YMM register, conditionally updated using writemask k1.
EVEX.128 encoded version: The first source operand is an XMM register. The second source operand is a XMM
register, a 128-bit memory location, or a 128-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a XMM register, conditionally updated using writemask k1.
Operation
VUNPCKLPS (EVEX Encoded Version When SRC2 is a ZMM Register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF VL >= 128
TMP_DEST[31:0] := SRC1[31:0]
TMP_DEST[63:32] := SRC2[31:0]
TMP_DEST[95:64] := SRC1[63:32]
TMP_DEST[127:96] := SRC2[63:32]
FI;
IF VL >= 256
TMP_DEST[159:128] := SRC1[159:128]
TMP_DEST[191:160] := SRC2[159:128]
TMP_DEST[223:192] := SRC1[191:160]
TMP_DEST[255:224] := SRC2[191:160]
FI;
IF VL >= 512
TMP_DEST[287:256] := SRC1[287:256]
TMP_DEST[319:288] := SRC2[287:256]
TMP_DEST[351:320] := SRC1[319:288]
TMP_DEST[383:352] := SRC2[319:288]
TMP_DEST[415:384] := SRC1[415:384]
TMP_DEST[447:416] := SRC2[415:384]
TMP_DEST[479:448] := SRC1[447:416]
TMP_DEST[511:480] := SRC2[447:416]
FI;
FOR j := 0 TO KL-1
UNPCKLPS—Unpack and Interleave Low Packed Single Precision Floating-Point Values Vol. 2B 4-742
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
UNPCKLPS—Unpack and Interleave Low Packed Single Precision Floating-Point Values Vol. 2B 4-743
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
Non-EVEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-52, “Type E4NF Class Exception Conditions.”
UNPCKLPS—Unpack and Interleave Low Packed Single Precision Floating-Point Values Vol. 2B 4-744
8. Updates to Chapter 5, Volume 2C
Change bars and violet text show changes to Chapter 5 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2C: Instruction Set Reference, V.
------------------------------------------------------------------------------------------
Changes to this chapter:
• Removed erroneous instruction listing.
• Updated the following instructions to add VEX-encoded forms: VCVTNEPS2BF16, VPMADD52HUQ, and
VPMADD52LUQ.
• Added the following instructions:
— VBCSTNEBF162PS
— VBCSTNESH2PS
— VCVTNEEBF162PS
— VCVTNEEBF162PS
— VCVTNEEPH2PS
— VCVTNEOBF162PS
— VCVTNEOPH2PS
— VSHA512MSG1
— VSHA512MSG2
— VSHA512RNDS2
— VSM3MSG1
— VSM3MSG2
— VSM3RNDS2
— VSM4KEY4
— VSM4RNDS4
— VPDPB[SU,UU,SS]D[,S]
— VPDPW[SU,US,UU]D[,S]
• Updated the VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD instructions with corrections.
• Added Intel® AVX10.1 information to the following instructions:
— VADDPH
— VADDSH
— VALIGND/VALIGNQ
— VBLENDMPD/VBLENDMPS
— VBROADCAST
— VCMPPH
— VCMPSH
— VCOMISH
— VCOMPRESSPD
— VCOMPRESSPS
— VCVTDQ2PH
— VCVTNE2PS2BF16
— VCVTNEPS2BF16
— VCVTPD2PH
— VCVTPD2QQ
— VCVTPD2UDQ
— VCVTPD2UQQ
— VCVTPH2DQ
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction adds packed FP16 values from source operands and stores the packed FP16 result in the destina-
tion operand. The destination elements are updated according to the writemask.
Operation
VADDPH (EVEX Encoded Versions) When SRC2 Operand is a Register
VL = 128, 256 or 512
KL := VL/16
IF (VL = 512) AND (EVEX.b = 1):
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.fp16[j] := SRC1.fp16[j] + SRC2.fp16[j]
ELSEIF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction adds the low FP16 value from the source operands and stores the FP16 result in the destination
operand.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
Operation
VADDSH (EVEX Encoded Versions)
IF EVEX.b = 1 and SRC2 is a register:
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)
IF k1[0] OR *no writemask*:
DEST.fp16[0] := SRC1.fp16[0] + SRC2.fp16[0]
ELSEIF *zeroing*:
DEST.fp16[0] := 0
// else dest.fp16[0] remains unchanged
DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:128] := 0
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Concatenates and shifts right doubleword/quadword elements of the first source operand (the second operand)
and the second source operand (the third operand) into a 1024/512/256-bit intermediate vector. The low
512/256/128-bit of the intermediate vector is written to the destination operand (the first operand) using the
writemask k1. The destination and first source operands are ZMM/YMM/XMM registers. The second source operand
can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted
from a 32/64-bit memory location.
This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1
are computed and stored into zmm1. Elements in zmm1 with the corresponding bit clear in k1 retain their previous
values (merging-masking) or are set to 0 (zeroing-masking).
Exceptions
See Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs an element-by-element blending between float64/float32 elements in the first source operand (the
second operand) with the elements in the second source operand (the third operand) using an opmask register as
select control. The blended result is written to the destination register.
The destination and first source operands are ZMM/YMM/XMM registers. The second source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-
bit memory location.
The opmask register is not used as a writemask for this instruction. Instead, the mask is used as an element
selector: every element of the destination is conditionally selected between first source or second source using the
value of the related mask bit (0 for first source operand, 1 for second source operand).
If EVEX.z is set, the elements with corresponding mask bit value of 0 in the destination operand are zeroed.
Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VBROADCASTSD/VBROADCASTSS/VBROADCASTF128 load floating-point values as one tuple from the source
operand (second operand) in memory and broadcast to all elements of the destination operand (first operand).
VEX256-encoded versions: The destination operand is a YMM register. The source operand is either a 32-bit, 64-
bit, or 128-bit memory location. Register source encodings are reserved and will #UD. Bits (MAXVL-1:256) of the
destination register are zeroed.
EVEX-encoded versions: The destination operand is a ZMM/YMM/XMM register and updated according to the write-
mask k1. The source operand is either a 32-bit, 64-bit memory location or the low doubleword/quadword element
of an XMM register.
VBROADCASTF32X2/VBROADCASTF32X4/VBROADCASTF64X2/VBROADCASTF32X8/VBROADCASTF64X4 load
floating-point values as tuples from the source operand (the second operand) in memory or register and broadcast
to all elements of the destination operand (the first operand). The destination operand is a YMM/ZMM register
updated according to the writemask k1. The source operand is either a register or 64-bit/128-bit/256-bit memory
location.
VBROADCASTSD and VBROADCASTF128,F32x4 and F64x2 are only supported as 256-bit and 512-bit wide
versions and up. VBROADCASTSS is supported in 128-bit, 256-bit and 512-bit wide versions. F32x8 and F64x4 are
only supported as 512-bit wide versions.
VBROADCASTF32X2/VBROADCASTF32X4/VBROADCASTF32X8 have 32-bit granularity. VBROADCASTF64X2 and
VBROADCASTF64X4 have 64-bit granularity.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
If VBROADCASTSD or VBROADCASTF128 is encoded with VEX.L= 0, an attempt to execute the instruction encoded
with VEX.L= 0 will cause an #UD exception.
DEST X0 X0 X0 X0 X0 X0 X0 X0
m32 X0
DEST 0 0 0 0 X0 X0 X0 X0
m64 X0
DEST X0 X0 X0 X0
m128 X0
DEST X0 X0
DEST X0 X0
Figure 5-5. VBROADCASTF64X4 Operation (512-bit version with writemask all 1s)
Operation
VBROADCASTSS (128-bit Version VEX and Legacy)
temp := SRC[31:0]
DEST[31:0] := temp
DEST[63:32] := temp
DEST[95:64] := temp
DEST[127:96] := temp
DEST[MAXVL-1:128] := 0
Exceptions
VEX-encoded instructions, see Table 2-23, “Type 6 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If VEX.L = 0 for VBROADCASTSD or VBROADCASTF128.
If EVEX.L’L = 0 for VBROADCASTSD/VBROADCASTF32X2/VBROADCASTF32X4/VBROAD-
CASTF64X2.
If EVEX.L’L < 10b for VBROADCASTF32X8/VBROADCASTF64X4.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction compares packed FP16 values from source operands and stores the result in the destination mask
operand. The comparison predicate operand (immediate byte bits 4:0) specifies the type of comparison performed
on each of the pairs of packed values. The destination elements are updated according to the writemask.
Operation
CASE (imm8 & 0x1F) OF
0: CMP_OPERATOR := EQ_OQ;
1: CMP_OPERATOR := LT_OS;
2: CMP_OPERATOR := LE_OS;
3: CMP_OPERATOR := UNORD_Q;
4: CMP_OPERATOR := NEQ_UQ;
5: CMP_OPERATOR := NLT_US;
6: CMP_OPERATOR := NLE_US;
7: CMP_OPERATOR := ORD_Q;
8: CMP_OPERATOR := EQ_UQ;
9: CMP_OPERATOR := NGE_US;
10: CMP_OPERATOR := NGT_US;
11: CMP_OPERATOR := FALSE_OQ;
12: CMP_OPERATOR := NEQ_OQ;
13: CMP_OPERATOR := GE_OS;
14: CMP_OPERATOR := GT_OS;
15: CMP_OPERATOR := TRUE_UQ;
16: CMP_OPERATOR := EQ_OS;
FOR j := 0 TO KL-1:
IF k2[j] OR *no writemask*:
IF EVEX.b = 1:
tsrc2 := SRC2.fp16[0]
ELSE:
tsrc2 := SRC2.fp16[j]
DEST.bit[j] := SRC1.fp16[j] CMP_OPERATOR tsrc2
ELSE
DEST.bit[j] := 0
DEST[MAXKL-1:KL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction compares the FP16 values from the lowest element of the source operands and stores the result in
the destination mask operand. The comparison predicate operand (immediate byte bits 4:0) specifies the type of
comparison performed on the pair of packed FP16 values. The low destination bit is updated according to the write-
mask. Bits MAXKL-1:1 of the destination operand are zeroed.
Operation
CASE (imm8 & 0x1F) OF
0: CMP_OPERATOR := EQ_OQ;
1: CMP_OPERATOR := LT_OS;
2: CMP_OPERATOR := LE_OS;
3: CMP_OPERATOR := UNORD_Q;
4: CMP_OPERATOR := NEQ_UQ;
5: CMP_OPERATOR := NLT_US;
6: CMP_OPERATOR := NLE_US;
7: CMP_OPERATOR := ORD_Q;
8: CMP_OPERATOR := EQ_UQ;
9: CMP_OPERATOR := NGE_US;
10: CMP_OPERATOR := NGT_US;
11: CMP_OPERATOR := FALSE_OQ;
12: CMP_OPERATOR := NEQ_OQ;
13: CMP_OPERATOR := GE_OS;
14: CMP_OPERATOR := GT_OS;
15: CMP_OPERATOR := TRUE_UQ;
16: CMP_OPERATOR := EQ_OS;
17: CMP_OPERATOR := LT_OQ;
18: CMP_OPERATOR := LE_OQ;
19: CMP_OPERATOR := UNORD_S;
20: CMP_OPERATOR := NEQ_US;
21: CMP_OPERATOR := NLT_UQ;
22: CMP_OPERATOR := NLE_UQ;
23: CMP_OPERATOR := ORD_S;
24: CMP_OPERATOR := EQ_US;
25: CMP_OPERATOR := NGE_UQ;
DEST[MAXKL-1:1] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction compares the FP16 values in the low word of operand 1 (first operand) and operand 2 (second
operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unordered, greater than,
less than, or equal). The OF, SF and AF flags in the EFLAGS register are set to 0. The unordered result is returned
if either source operand is a NaN (QNaN or SNaN).
Operand 1 is an XMM register; operand 2 can be an XMM register or a 16-bit memory location.
The VCOMISH instruction differs from the VUCOMISH instruction in that it signals a SIMD floating-point invalid oper-
ation exception (#I) when a source operand is either a QNaN or SNaN. The VUCOMISH instruction signals an invalid
numeric exception only if a source operand is an SNaN.
The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated. EVEX.vvvv is
reserved and must be 1111b, otherwise instructions will #UD.
Operation
VCOMISH SRC1, SRC2
RESULT := OrderedCompare(SRC1.fp16[0],SRC2.fp16[0])
IF RESULT is UNORDERED:
ZF, PF, CF := 1, 1, 1
ELSE IF RESULT is GREATER_THAN:
ZF, PF, CF := 0, 0, 0
ELSE IF RESULT is LESS_THAN:
ZF, PF, CF := 0, 0, 1
ELSE: // RESULT is EQUALS
ZF, PF, CF := 1, 0, 0
OF, AF, SF := 0, 0, 0
VCOMISH—Compare Scalar Ordered FP16 Values and Set EFLAGS Vol. 2C 5-27
SIMD Floating-Point Exceptions
Invalid, Denormal.
Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
VCOMISH—Compare Scalar Ordered FP16 Values and Set EFLAGS Vol. 2C 5-28
VCOMPRESSPD—Store Sparse Packed Double Precision Floating-Point Values Into Dense
Memory
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F38.W1 8A /r A V/V (AVX512VL AND Compress packed double precision floating-
VCOMPRESSPD xmm1/m128 {k1}{z}, AVX512F) OR point values from xmm2 to xmm1/m128 using
xmm2 AVX10.11 writemask k1.
EVEX.256.66.0F38.W1 8A /r A V/V (AVX512VL AND Compress packed double precision floating-
VCOMPRESSPD ymm1/m256 {k1}{z}, AVX512F) OR point values from ymm2 to ymm1/m256 using
ymm2 AVX10.11 writemask k1.
EVEX.512.66.0F38.W1 8A /r A V/V AVX512F Compress packed double precision floating-
VCOMPRESSPD zmm1/m512 {k1}{z}, OR AVX10.11 point values from zmm2 using control mask k1
zmm2 to zmm1/m512.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Compress (store) up to 8 double precision floating-point values from the source operand (the second operand) as
a contiguous vector to the destination operand (the first operand) The source operand is a ZMM/YMM/XMM register,
the destination operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The opmask register k1 selects the active elements (partial vector or possibly non-contiguous if less than 8 active
elements) from the source operand to compress into a contiguous vector. The contiguous vector is written to the
destination starting from the low element of the destination operand.
Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z
must be zero.
Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the
source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper
bits are zeroed.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.
VCOMPRESSPD—Store Sparse Packed Double Precision Floating-Point Values Into Dense Memory Vol. 2C 5-29
Operation
VCOMPRESSPD (EVEX Encoded Versions) Store Form
(KL, VL) = (2, 128), (4, 256), (8, 512)
SIZE := 64
k := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
DEST[k+SIZE-1:k] := SRC[i+63:i]
k := k + SIZE
FI;
ENDFOR
Other Exceptions
EVEX-encoded instructions, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCOMPRESSPD—Store Sparse Packed Double Precision Floating-Point Values Into Dense Memory Vol. 2C 5-30
VCOMPRESSPS—Store Sparse Packed Single Precision Floating-Point Values Into Dense Memory
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F38.W0 8A /r A V/V (AVX512VL AND Compress packed single precision floating-
VCOMPRESSPS xmm1/m128 {k1}{z}, AVX512F) OR point values from xmm2 to xmm1/m128 using
xmm2 AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 8A /r A V/V (AVX512VL AND Compress packed single precision floating-
VCOMPRESSPS ymm1/m256 {k1}{z}, AVX512F) OR point values from ymm2 to ymm1/m256 using
ymm2 AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 8A /r A V/V AVX512F Compress packed single precision floating-
VCOMPRESSPS zmm1/m512 {k1}{z}, OR AVX10.11 point values from zmm2 using control mask k1
zmm2 to zmm1/m512.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Compress (stores) up to 16 single precision floating-point values from the source operand (the second operand) to
the destination operand (the first operand). The source operand is a ZMM/YMM/XMM register, the destination
operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The opmask register k1 selects the active elements (a partial vector or possibly non-contiguous if less than 16
active elements) from the source operand to compress into a contiguous vector. The contiguous vector is written to
the destination starting from the low element of the destination operand.
Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z
must be zero.
Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the
source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper
bits are zeroed.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.
VCOMPRESSPS—Store Sparse Packed Single Precision Floating-Point Values Into Dense Memory Vol. 2C 5-31
Operation
VCOMPRESSPS (EVEX Encoded Versions) Store Form
(KL, VL) = (4, 128), (8, 256), (16, 512)
SIZE := 32
k := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
DEST[k+SIZE-1:k] := SRC[i+31:i]
k := k + SIZE
FI;
ENDFOR;
Other Exceptions
EVEX-encoded instructions, see Exceptions Type E4.nb. in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCOMPRESSPS—Store Sparse Packed Single Precision Floating-Point Values Into Dense Memory Vol. 2C 5-32
VCVTDQ2PH—Convert Packed Signed Doubleword Integers to Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 5B /r A V/V (AVX512-FP16 Convert four packed signed doubleword integers
VCVTDQ2PH xmm1{k1}{z}, AND AVX512VL) from xmm2/m128/m32bcst to four packed FP16
xmm2/m128/m32bcst OR AVX10.11 values, and store the result in xmm1 subject to
writemask k1.
EVEX.256.NP.MAP5.W0 5B /r A V/V (AVX512-FP16 Convert eight packed signed doubleword integers
VCVTDQ2PH xmm1{k1}{z}, AND AVX512VL) from ymm2/m256/m32bcst to eight packed
ymm2/m256/m32bcst OR AVX10.11 FP16 values, and store the result in xmm1
subject to writemask k1.
EVEX.512.NP.MAP5.W0 5B /r A V/V AVX512-FP16 Convert sixteen packed signed doubleword
VCVTDQ2PH ymm1{k1}{z}, OR AVX10.11 integers from zmm2/m512/m32bcst to sixteen
zmm2/m512/m32bcst {er} packed FP16 values, and store the result in
ymm1 subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts four, eight, or sixteen packed signed doubleword integers in the source operand to four,
eight, or sixteen packed FP16 values in the destination operand.
EVEX encoded versions: The source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcast from a 32-bit memory location. The destination operand is a YMM/XMM
register conditionally updated with writemask k1.
EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
If the result of the convert operation is overflow and MXCSR.OM=0 then a SIMD exception will be raised with OE=1,
PE=1.
VCVTDQ2PH—Convert Packed Signed Doubleword Integers to Packed FP16 Values Vol. 2C 5-33
Operation
VCVTDQ2PH DEST, SRC
VL = 128, 256 or 512
KL := VL / 32
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.dword[0]
ELSE
tsrc := SRC.dword[j]
DEST.fp16[j] := Convert_integer32_to_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL/2] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VCVTDQ2PH—Convert Packed Signed Doubleword Integers to Packed FP16 Values Vol. 2C 5-34
VCVTNE2PS2BF16—Convert Two Packed Single Data to One Packed BF16 Data
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F2.0F38.W0 72 /r A V/V (AVX512_BF16 Convert packed single data from xmm2 and
VCVTNE2PS2BF16 xmm1{k1}{z}, AND AVX512VL) xmm3/m128/m32bcst to packed BF16 data in
xmm2, xmm3/m128/m32bcst OR AVX10.11 xmm1 with writemask k1.
EVEX.256.F2.0F38.W0 72 /r A V/V (AVX512_BF16 Convert packed single data from ymm2 and
VCVTNE2PS2BF16 ymm1{k1}{z}, AND AVX512VL) ymm3/m256/m32bcst to packed BF16 data in
ymm2, ymm3/m256/m32bcst OR AVX10.11 ymm1 with writemask k1.
EVEX.512.F2.0F38.W0 72 /r A V/V (AVX512_BF16 Convert packed single data from zmm2 and
VCVTNE2PS2BF16 zmm1{k1}{z}, AND AVX512F) zmm3/m512/m32bcst to packed BF16 data in
zmm2, zmm3/m512/m32bcst OR AVX10.11 zmm1 with writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts two SIMD registers of packed single data into a single register of packed BF16 data.
This instruction does not support memory fault suppression.
This instruction uses “Round to nearest (even)” rounding mode. Output denormals are always flushed to zero and
input denormals are always treated as zero. MXCSR is not consulted nor updated. No floating-point exceptions are
generated.
Operation
VCVTNE2PS2BF16 dest, src1, src2
VL = (128, 256, 512)
KL = VL/16
origdest := dest
FOR i := 0 to KL-1:
IF k1[ i ] or *no writemask*:
IF i < KL/2:
IF src2 is memory and evex.b == 1:
t := src2.fp32[0]
ELSE:
t := src2.fp32[ i ]
ELSE:
t := src1.fp32[ i-KL/2]
ELSE IF *zeroing*:
dest.word[ i ] := 0
ELSE: // Merge masking, dest element unchanged
VCVTNE2PS2BF16—Convert Two Packed Single Data to One Packed BF16 Data Vol. 2C 5-35
dest.word[ i ] := origdest.word[ i ]
DEST[MAXVL-1:VL] := 0
Other Exceptions
See Table 2-52, “Type E4NF Class Exception Conditions.”
VCVTNE2PS2BF16—Convert Two Packed Single Data to One Packed BF16 Data Vol. 2C 5-36
VCVTNEPS2BF16—Convert Packed Single Data to Packed BF16 Data
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.F3.0F38.W0 72 /r A V/V AVX-NE- Convert packed single precision floating-point
VCVTNEPS2BF16 xmm1, CONVERT values from xmm2/m128 to packed BF16
xmm2/m128 values and store in xmm1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction loads packed FP32 elements from a SIMD register or memory, converts the elements to BF16, and
writes the result to the destination SIMD register.
The upper bits of the destination register beyond the down-converted BF16 elements are zeroed.
This instruction uses “Round to nearest (even)” rounding mode. Output denormals are always flushed to zero and
input denormals are always treated as zero. MXCSR is not consulted nor updated.
As the instruction operand encoding table shows, the EVEX.vvvv field is not used for encoding an operand.
EVEX.vvvv is reserved and must be 0b1111 otherwise instructions will #UD.
Operation
Define convert_fp32_to_bfloat16(x):
IF x is zero or denormal:
dest[15] := x[31] // sign preserving zero (denormal go to zero)
dest[14:0] := 0
ELSE IF x is infinity:
dest[15:0] := x[31:16]
ELSE IF x is NAN:
dest[15:0] := x[31:16] // truncate and set MSB of the mantissa to force QNAN
dest[6] := 1
ELSE // normal number
FOR i := 0 to KL/2-1:
t := src.fp32[i]
dest.word[i] := convert_fp32_to_bfloat16(t)
DEST[MAXVL-1:VL/2] := 0
origdest := dest
FOR i := 0 to KL/2-1:
IF k1[ i ] or *no writemask*:
IF src is memory and evex.b == 1:
t := src.fp32[0]
ELSE:
t := src.fp32[ i ]
dest.word[i] := convert_fp32_to_bfloat16(t)
ELSE IF *zeroing*:
dest.word[ i ] := 0
ELSE: // Merge masking, dest element unchanged
dest.word[ i ] := origdest.word[ i ]
DEST[MAXVL-1:VL/2] := 0
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts two, four, or eight packed double precision floating-point values in the source operand
(second operand) to two, four, or eight packed FP16 values in the destination operand (first operand). When a
conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or
the embedded rounding control bits.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or
a 512/256/128-bit vector broadcasts from a 64-bit memory location. The destination operand is a XMM register
conditionally updated with writemask k1. The upper bits (MAXVL-1:128/64/32) of the corresponding destination
are zeroed.
EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
This instruction uses MXCSR.DAZ for handling FP64 inputs. FP16 outputs can be normal or denormal, and are not
conditionally flushed to zero.
VCVTPD2PH—Convert Packed Double Precision FP Values to Packed FP16 Values Vol. 2C 5-43
Operation
VCVTPD2PH DEST, SRC
VL = 128, 256 or 512
KL := VL / 64
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.double[0]
ELSE
tsrc := SRC.double[j]
DEST.fp16[j] := Convert_fp64_to_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL/4] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VCVTPD2PH—Convert Packed Double Precision FP Values to Packed FP16 Values Vol. 2C 5-44
VCVTPD2QQ—Convert Packed Double Precision Floating-Point Values to Packed Quadword
Integers
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F.W1 7B /r A V/V (AVX512VL AND Convert two packed double precision floating-point values
VCVTPD2QQ xmm1 {k1}{z}, AVX512DQ) OR from xmm2/m128/m64bcst to two packed quadword
xmm2/m128/m64bcst AVX10.11 integers in xmm1 with writemask k1.
EVEX.256.66.0F.W1 7B /r A V/V (AVX512VL AND Convert four packed double precision floating-point
VCVTPD2QQ ymm1 {k1}{z}, AVX512DQ) OR values from ymm2/m256/m64bcst to four packed
ymm2/m256/m64bcst AVX10.11 quadword integers in ymm1 with writemask k1.
EVEX.512.66.0F.W1 7B /r A V/V AVX512DQ Convert eight packed double precision floating-point
VCVTPD2QQ zmm1 {k1}{z}, OR AVX10.11 values from zmm2/m512/m64bcst to eight packed
zmm2/m512/m64bcst {er} quadword integers in zmm1 with writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts packed double precision floating-point values in the source operand (second operand) to packed quad-
word integers in the destination operand (first operand).
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The destination operation is a ZMM/YMM/XMM register conditionally updated with writemask k1.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value
(2w-1, where w represents the number of bits in the destination format) is returned.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
VCVTPD2QQ (EVEX Encoded Version) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL == 512) AND (EVEX.b == 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Double_Precision_Floating_Point_To_QuadInteger(SRC[i+63:i])
ELSE
VCVTPD2QQ—Convert Packed Double Precision Floating-Point Values to Packed Quadword Integers Vol. 2C 5-45
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] := Convert_Double_Precision_Floating_Point_To_QuadInteger(SRC[63:0])
ELSE
DEST[i+63:i] := Convert_Double_Precision_Floating_Point_To_QuadInteger(SRC[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTPD2QQ—Convert Packed Double Precision Floating-Point Values to Packed Quadword Integers Vol. 2C 5-46
VCVTPD2UDQ—Convert Packed Double Precision Floating-Point Values to Packed Unsigned
Doubleword Integers
Opcode Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.0F.W1 79 /r A V/V (AVX512VL AND Convert two packed double precision floating-point
VCVTPD2UDQ xmm1 {k1}{z}, AVX512F) OR values in xmm2/m128/m64bcst to two unsigned
xmm2/m128/m64bcst AVX10.11 doubleword integers in xmm1 subject to writemask
k1.
EVEX.256.0F.W1 79 /r A V/V (AVX512VL AND Convert four packed double precision floating-point
VCVTPD2UDQ xmm1 {k1}{z}, AVX512F) OR values in ymm2/m256/m64bcst to four unsigned
ymm2/m256/m64bcst AVX10.11 doubleword integers in xmm1 subject to writemask
k1.
EVEX.512.0F.W1 79 /r A V/V AVX512F Convert eight packed double precision floating-point
VCVTPD2UDQ ymm1 {k1}{z}, OR AVX10.11 values in zmm2/m512/m64bcst to eight unsigned
zmm2/m512/m64bcst {er} doubleword integers in ymm1 subject to writemask
k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts packed double precision floating-point values in the source operand (the second operand) to packed
unsigned doubleword integers in the destination operand (the first operand).
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is
returned, where w represents the number of bits in the destination format.
The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1. The upper bits (MAXVL-1:256) of the corresponding destination are zeroed.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
VCVTPD2UDQ (EVEX Encoded Versions) When SRC2 Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
VCVTPD2UDQ—Convert Packed Double Precision Floating-Point Values to Packed Unsigned Doubleword Integers Vol. 2C 5-47
IF k1[j] OR *no writemask*
THEN
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_UInteger(SRC[k+63:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_UInteger(SRC[63:0])
ELSE
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_UInteger(SRC[k+63:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0
VCVTPD2UDQ—Convert Packed Double Precision Floating-Point Values to Packed Unsigned Doubleword Integers Vol. 2C 5-48
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTPD2UDQ—Convert Packed Double Precision Floating-Point Values to Packed Unsigned Doubleword Integers Vol. 2C 5-49
VCVTPD2UQQ—Convert Packed Double Precision Floating-Point Values to Packed Unsigned
Quadword Integers
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F.W1 79 /r A V/V (AVX512VL AND Convert two packed double precision floating-point
VCVTPD2UQQ xmm1 {k1}{z}, AVX512DQ) OR values from xmm2/mem to two packed unsigned
xmm2/m128/m64bcst AVX10.11 quadword integers in xmm1 with writemask k1.
EVEX.256.66.0F.W1 79 /r A V/V (AVX512VL AND Convert fourth packed double precision floating-point
VCVTPD2UQQ ymm1 {k1}{z}, AVX512DQ) OR values from ymm2/mem to four packed unsigned
ymm2/m256/m64bcst AVX10.11 quadword integers in ymm1 with writemask k1.
EVEX.512.66.0F.W1 79 /r A V/V AVX512DQ Convert eight packed double precision floating-point
VCVTPD2UQQ zmm1 {k1}{z}, OR AVX10.11 values from zmm2/mem to eight packed unsigned
zmm2/m512/m64bcst {er} quadword integers in zmm1 with writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts packed double precision floating-point values in the source operand (second operand) to packed unsigned
quadword integers in the destination operand (first operand).
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is
returned, where w represents the number of bits in the destination format.
The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operation
is a ZMM/YMM/XMM register conditionally updated with writemask k1.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
VCVTPD2UQQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL == 512) AND (EVEX.b == 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Double_Precision_Floating_Point_To_UQuadInteger(SRC[i+63:i])
ELSE
VCVTPD2UQQ—Convert Packed Double Precision Floating-Point Values to Packed Unsigned Quadword Integers Vol. 2C 5-49
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] :=
Convert_Double_Precision_Floating_Point_To_UQuadInteger(SRC[63:0])
ELSE
DEST[i+63:i] :=
Convert_Double_Precision_Floating_Point_To_UQuadInteger(SRC[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VCVTPD2UQQ—Convert Packed Double Precision Floating-Point Values to Packed Unsigned Quadword Integers Vol. 2C 5-50
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTPD2UQQ—Convert Packed Double Precision Floating-Point Values to Packed Unsigned Quadword Integers Vol. 2C 5-51
VCVTPH2DQ—Convert Packed FP16 Values to Signed Doubleword Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP5.W0 5B /r A V/V (AVX512-FP16 Convert four packed FP16 values in
VCVTPH2DQ xmm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to four signed doubleword
xmm2/m64/m16bcst OR AVX10.11 integers, and store the result in xmm1 subject to
writemask k1.
EVEX.256.66.MAP5.W0 5B /r A V/V (AVX512-FP16 Convert eight packed FP16 values in
VCVTPH2DQ ymm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to eight signed
xmm2/m128/m16bcst OR AVX10.11 doubleword integers, and store the result in
ymm1 subject to writemask k1.
EVEX.512.66.MAP5.W0 5B /r A V/V AVX512-FP16 Convert sixteen packed FP16 values in
VCVTPH2DQ zmm1{k1}{z}, OR AVX10.11 ymm2/m256/m16bcst to sixteen signed
ymm2/m256/m16bcst {er} doubleword integers, and store the result in
zmm1 subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed FP16 values in the source operand to signed doubleword integers in destination
operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value is
returned.
The destination elements are updated according to the writemask.
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.dword[j] := Convert_fp16_to_integer32(tsrc)
ELSE IF *zeroing*:
DEST.dword[j] := 0
// else dest.dword[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed FP16 values to FP64 values in the destination register. The destination elements
are updated according to the writemask.
This instruction handles both normal and denormal FP16 inputs.
Operation
VCVTPH2PD DEST, SRC
VL = 128, 256, or 512
KL := VL/64
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.fp64[j] := Convert_fp16_to_fp64(tsrc)
ELSE IF *zeroing*:
DEST.fp64[j] := 0
// else dest.fp64[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed half precision (16-bits) floating-point values in the low-order bits of the source
operand (the second operand) to packed single precision floating-point values and writes the converted values into
the destination operand (the first operand).
If case of a denormal operand, the correct normal result is returned. MXCSR.DAZ is ignored and is treated as if it
0. No denormal exception is reported on MXCSR.
VEX.128 version: The source operand is a XMM register or 64-bit memory location. The destination operand is a
XMM register. The upper bits (MAXVL-1:128) of the corresponding destination register are zeroed.
VCVTPH2PS/VCVTPH2PSX—Convert Packed FP16 Values to Single Precision Floating-Point Values Vol. 2C 5-55
VEX.256 version: The source operand is a XMM register or 128-bit memory location. The destination operand is a
YMM register. Bits (MAXVL-1:256) of the corresponding destination register are zeroed.
EVEX encoded versions: The source operand is a YMM/XMM/XMM (low 64-bits) register or a 256/128/64-bit
memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.
The diagram below illustrates how data is converted from four packed half precision (in 64 bits) to four single preci-
sion (in 128 bits) floating-point values.
Note: VEX.vvvv and EVEX.vvvv are reserved (must be 1111b).
convert convert
convert convert
127 96 95 64 63 32 31 0
VS3 VS2 VS1 VS0 xmm1
The VCVTPH2PSX instruction is a new form of the PH to PS conversion instruction, encoded in map 6. The previous
version of the instruction, VCVTPH2PS, that is present in AVX512F (encoded in map 2, 0F38) does not support
embedded broadcasting. The VCVTPH2PSX instruction has the embedded broadcasting option available.
The instructions associated with AVX512_FP16 always handle FP16 denormal number inputs; denormal inputs are
not treated as zero.
Operation
vCvt_h2s(SRC1[15:0])
{
RETURN Cvt_Half_Precision_To_Single_Precision(SRC1[15:0]);
}
VCVTPH2PS/VCVTPH2PSX—Convert Packed FP16 Values to Single Precision Floating-Point Values Vol. 2C 5-56
VCVTPH2PS (VEX.256 Encoded Version)
DEST[31:0] := vCvt_h2s(SRC1[15:0]);
DEST[63:32] := vCvt_h2s(SRC1[31:16]);
DEST[95:64] := vCvt_h2s(SRC1[47:32]);
DEST[127:96] := vCvt_h2s(SRC1[63:48]);
DEST[159:128] := vCvt_h2s(SRC1[79:64]);
DEST[191:160] := vCvt_h2s(SRC1[95:80]);
DEST[223:192] := vCvt_h2s(SRC1[111:96]);
DEST[255:224] := vCvt_h2s(SRC1[127:112]);
DEST[MAXVL-1:256] := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.fp32[j] := Convert_fp16_to_fp32(tsrc)
ELSE IF *zeroing*:
DEST.fp32[j] := 0
// else dest.fp32[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Flags Affected
None.
VCVTPH2PS/VCVTPH2PSX—Convert Packed FP16 Values to Single Precision Floating-Point Values Vol. 2C 5-57
VCVTPH2PSX __m512 _mm512_cvtx_roundph_ps (__m256h a, int sae);
VCVTPH2PSX __m512 _mm512_mask_cvtx_roundph_ps (__m512 src, __mmask16 k, __m256h a, int sae);
VCVTPH2PSX __m512 _mm512_maskz_cvtx_roundph_ps (__mmask16 k, __m256h a, int sae);
VCVTPH2PSX __m128 _mm_cvtxph_ps (__m128h a);
VCVTPH2PSX __m128 _mm_mask_cvtxph_ps (__m128 src, __mmask8 k, __m128h a);
VCVTPH2PSX __m128 _mm_maskz_cvtxph_ps (__mmask8 k, __m128h a);
VCVTPH2PSX __m256 _mm256_cvtxph_ps (__m128h a);
VCVTPH2PSX __m256 _mm256_mask_cvtxph_ps (__m256 src, __mmask8 k, __m128h a);
VCVTPH2PSX __m256 _mm256_maskz_cvtxph_ps (__mmask8 k, __m128h a);
VCVTPH2PSX __m512 _mm512_cvtxph_ps (__m256h a);
VCVTPH2PSX __m512 _mm512_mask_cvtxph_ps (__m512 src, __mmask16 k, __m256h a);
VCVTPH2PSX __m512 _mm512_maskz_cvtxph_ps (__mmask16 k, __m256h a);
Other Exceptions
VEX-encoded instructions, see Table 2-26, “Type 11 Class Exception Conditions” (do not report #AC).
EVEX-encoded instructions, see Table 2-62, “Type E11 Class Exception Conditions.”
EVEX-encoded instructions with broadcast (VCVTPH2PSX), see Table 2-46, “Type E2 Class Exception Conditions.”
Additionally:
#UD If VEX.W=1.
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
VCVTPH2PS/VCVTPH2PSX—Convert Packed FP16 Values to Single Precision Floating-Point Values Vol. 2C 5-58
VCVTPH2QQ—Convert Packed FP16 Values to Signed Quadword Integer Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP5.W0 7B /r A V/V (AVX512-FP16 Convert two packed FP16 values in
VCVTPH2QQ xmm1{k1}{z}, AND AVX512VL) xmm2/m32/m16bcst to two signed quadword
xmm2/m32/m16bcst OR AVX10.11 integers, and store the result in xmm1 subject to
writemask k1.
EVEX.256.66.MAP5.W0 7B /r A V/V (AVX512-FP16 Convert four packed FP16 values in
VCVTPH2QQ ymm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to four signed quadword
xmm2/m64/m16bcst OR AVX10.11 integers, and store the result in ymm1 subject to
writemask k1.
EVEX.512.66.MAP5.W0 7B /r A V/V AVX512-FP16 Convert eight packed FP16 values in
VCVTPH2QQ zmm1{k1}{z}, OR AVX10.11 xmm2/m128/m16bcst to eight signed quadword
xmm2/m128/m16bcst {er} integers, and store the result in zmm1 subject to
writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed FP16 values in the source operand to signed quadword integers in destination
operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value is
returned.
The destination elements are updated according to the writemask.
VCVTPH2QQ—Convert Packed FP16 Values to Signed Quadword Integer Values Vol. 2C 5-59
Operation
VCVTPH2QQ DEST, SRC
VL = 128, 256 or 512
KL := VL / 64
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.qword[j] := Convert_fp16_to_integer64(tsrc)
ELSE IF *zeroing*:
DEST.qword[j] := 0
// else dest.qword[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VCVTPH2QQ—Convert Packed FP16 Values to Signed Quadword Integer Values Vol. 2C 5-60
VCVTPH2UDQ—Convert Packed FP16 Values to Unsigned Doubleword Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 79 /r A V/V (AVX512-FP16 Convert four packed FP16 values in
VCVTPH2UDQ xmm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to four unsigned
xmm2/m64/m16bcst OR AVX10.11 doubleword integers, and store the result in
xmm1 subject to writemask k1.
EVEX.256.NP.MAP5.W0 79 /r A V/V (AVX512-FP16 Convert eight packed FP16 values in
VCVTPH2UDQ ymm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to eight unsigned
xmm2/m128/m16bcst OR AVX10.11 doubleword integers, and store the result in
ymm1 subject to writemask k1.
EVEX.512.NP.MAP5.W0 79 /r A V/V AVX512-FP16 Convert sixteen packed FP16 values in
VCVTPH2UDQ zmm1{k1}{z}, OR AVX10.11 ymm2/m256/m16bcst to sixteen unsigned
ymm2/m256/m16bcst {er} doubleword integers, and store the result in
zmm1 subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed FP16 values in the source operand to unsigned doubleword integers in destination
operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value is
returned.
The destination elements are updated according to the writemask.
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.dword[j] := Convert_fp16_to_unsigned_integer32(tsrc)
ELSE IF *zeroing*:
DEST.dword[j] := 0
// else dest.dword[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed FP16 values in the source operand to unsigned quadword integers in destination
operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value is
returned.
The destination elements are updated according to the writemask.
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.qword[j] := Convert_fp16_to_unsigned_integer64(tsrc)
ELSE IF *zeroing*:
DEST.qword[j] := 0
// else dest.qword[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed FP16 values in the source operand to unsigned word integers in the destination
operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value is
returned.
The destination elements are updated according to the writemask.
Operation
VCVTPH2UW DEST, SRC
VL = 128, 256 or 512
KL := VL / 16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.word[j] := Convert_fp16_to_unsigned_integer16(tsrc)
ELSE IF *zeroing*:
DEST.word[j] := 0
// else dest.word[j] remains unchanged
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed FP16 values in the source operand to signed word integers in the destination
operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value is
returned.
The destination elements are updated according to the writemask.
Operation
VCVTPH2W DEST, SRC
VL = 128, 256 or 512
KL := VL / 16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.word[j] := Convert_fp16_to_integer16(tsrc)
ELSE IF *zeroing*:
DEST.word[j] := 0
// else dest.word[j] remains unchanged
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Convert packed single precision floating values in the source operand to half-precision (16-bit) floating-point
values and store to the destination operand. The rounding mode is specified using the immediate field (imm8).
Underflow results (i.e., tiny results) are converted to denormals. MXCSR.FTZ is ignored. If a source element is
denormal relative to the input format with DM masked and at least one of PM or UM unmasked; a SIMD exception
will be raised with DE, UE and PE set.
127 96 95 64 63 48 47 32 31 16 15 0
VH3 VH2 VH1 VH0 xmm1/mem64
The immediate byte defines several bit fields that control rounding operation. The effect and encoding of the RC
field are listed in Table 5-3.
Table 5-3. Immediate Byte Encoding for 16-bit Floating-Point Conversion Instructions
Bits Field Name/value Description Comment
Imm[1:0] RC=00B Round to nearest even If Imm[2] = 0
RC=01B Round down
RC=10B Round up
RC=11B Truncate
Imm[2] MS1=0 Use imm[1:0] for rounding Ignore MXCSR.RC
MS1=1 Use MXCSR.RC for rounding
Imm[7:3] Ignored Ignored by processor
VEX.128 version: The source operand is a XMM register. The destination operand is a XMM register or 64-bit
memory location. If the destination operand is a register then the upper bits (MAXVL-1:64) of corresponding
register are zeroed.
VEX.256 version: The source operand is a YMM register. The destination operand is a XMM register or 128-bit
memory location. If the destination operand is a register, the upper bits (MAXVL-1:128) of the corresponding desti-
nation register are zeroed.
Note: VEX.vvvv and EVEX.vvvv are reserved (must be 1111b).
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register. The destination operand is a
YMM/XMM/XMM (low 64-bits) register or a 256/128/64-bit memory location, conditionally updated with writemask
k1. Bits (MAXVL-1:256/128/64) of the corresponding destination register are zeroed.
Operation
vCvt_s2h(SRC1[31:0])
{
IF Imm[2] = 0
THEN ; using Imm[1:0] for rounding control, see Table 5-3
RETURN Cvt_Single_Precision_To_Half_Precision_FP_Imm(SRC1[31:0]);
ELSE ; using MXCSR.RC for rounding control
RETURN Cvt_Single_Precision_To_Half_Precision_FP_Mxcsr(SRC1[31:0]);
FI;
}
Flags Affected
None.
Other Exceptions
VEX-encoded instructions, see Table 2-26, “Type 11 Class Exception Conditions” (do not report #AC);
EVEX-encoded instructions, see Table 2-62, “Type E11 Class Exception Conditions.”
Additionally:
#UD If VEX.W=1.
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed single precision floating values in the source operand to FP16 values and stores to
the destination operand.
The VCVTPS2PHX instruction supports broadcasting.
This instruction uses MXCSR.DAZ for handling FP32 inputs. FP16 outputs can be normal or denormal numbers, and
are not conditionally flushed based on MXCSR settings.
Operation
VCVTPS2PHX DEST, SRC (AVX512_FP16 Load Version With Broadcast Support)
VL = 128, 256, or 512
KL := VL / 32
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp32[0]
ELSE
tsrc := SRC.fp32[j]
DEST.fp16[j] := Convert_fp32_to_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
VCVTPS2PHX—Convert Packed Single Precision Floating-Point Values to Packed FP16 Values Vol. 2C 5-73
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL/2] := 0
Flags Affected
None.
Other Exceptions
EVEX-encoded instructions, see Table 2-46, “Type E2 Class Exception Conditions.”
Additionally:
#UD If VEX.W=1.
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
VCVTPS2PHX—Convert Packed Single Precision Floating-Point Values to Packed FP16 Values Vol. 2C 5-74
VCVTPS2QQ—Convert Packed Single Precision Floating-Point Values to Packed Signed
Quadword Integer Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F.W0 7B /r A V/V (AVX512VL AND Convert two packed single precision floating-point values
VCVTPS2QQ xmm1 {k1}{z}, AVX512DQ) OR from xmm2/m64/m32bcst to two packed signed
xmm2/m64/m32bcst AVX10.11 quadword values in xmm1 subject to writemask k1.
EVEX.256.66.0F.W0 7B /r A V/V (AVX512VL AND Convert four packed single precision floating-point values
VCVTPS2QQ ymm1 {k1}{z}, AVX512DQ) OR from xmm2/m128/m32bcst to four packed signed
xmm2/m128/m32bcst AVX10.11 quadword values in ymm1 subject to writemask k1.
EVEX.512.66.0F.W0 7B /r A V/V AVX512DQ Convert eight packed single precision floating-point values
VCVTPS2QQ zmm1 {k1}{z}, OR AVX10.11 from ymm2/m256/m32bcst to eight packed signed
ymm2/m256/m32bcst {er} quadword values in zmm1 subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts eight packed single precision floating-point values in the source operand to eight signed quadword inte-
gers in the destination operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value
(2w-1, where w represents the number of bits in the destination format) is returned.
The source operand is a YMM/XMM/XMM (low 64- bits) register or a 256/128/64-bit memory location. The destina-
tion operation is a ZMM/YMM/XMM register conditionally updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
VCVTPS2QQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL == 512) AND (EVEX.b == 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Single_Precision_To_QuadInteger(SRC[k+31:k])
ELSE
VCVTPS2QQ—Convert Packed Single Precision Floating-Point Values to Packed Signed Quadword Integer Values Vol. 2C 5-75
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] :=
Convert_Single_Precision_To_QuadInteger(SRC[31:0])
ELSE
DEST[i+63:i] :=
Convert_Single_Precision_To_QuadInteger(SRC[k+31:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VCVTPS2QQ—Convert Packed Single Precision Floating-Point Values to Packed Signed Quadword Integer Values Vol. 2C 5-76
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTPS2QQ—Convert Packed Single Precision Floating-Point Values to Packed Signed Quadword Integer Values Vol. 2C 5-77
VCVTPS2UDQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned
Doubleword Integer Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.0F.W0 79 /r A V/V (AVX512VL AND Convert four packed single precision floating-
VCVTPS2UDQ xmm1 {k1}{z}, AVX512F) OR point values from xmm2/m128/m32bcst to four
xmm2/m128/m32bcst AVX10.11 packed unsigned doubleword values in xmm1
subject to writemask k1.
EVEX.256.0F.W0 79 /r A V/V (AVX512VL AND Convert eight packed single precision floating-
VCVTPS2UDQ ymm1 {k1}{z}, AVX512F) OR point values from ymm2/m256/m32bcst to eight
ymm2/m256/m32bcst AVX10.11 packed unsigned doubleword values in ymm1
subject to writemask k1.
EVEX.512.0F.W0 79 /r A V/V AVX512F Convert sixteen packed single precision floating-
VCVTPS2UDQ zmm1 {k1}{z}, OR AVX10.11 point values from zmm2/m512/m32bcst to
zmm2/m512/m32bcst {er} sixteen packed unsigned doubleword values in
zmm1 subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts sixteen packed single precision floating-point values in the source operand to sixteen unsigned double-
word integers in the destination operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is
returned, where w represents the number of bits in the destination format.
The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
VCVTPS2UDQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Doubleword Integer Values Vol. 2C 5-77
Operation
VCVTPS2UDQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_UInteger(SRC[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no *
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_UInteger(SRC[31:0])
ELSE
DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_UInteger(SRC[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VCVTPS2UDQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Doubleword Integer Values Vol. 2C 5-78
Intel C/C++ Compiler Intrinsic Equivalent
VCVTPS2UDQ __m512i _mm512_cvtps_epu32( __m512 a);
VCVTPS2UDQ __m512i _mm512_mask_cvtps_epu32( __m512i s, __mmask16 k, __m512 a);
VCVTPS2UDQ __m512i _mm512_maskz_cvtps_epu32( __mmask16 k, __m512 a);
VCVTPS2UDQ __m512i _mm512_cvt_roundps_epu32( __m512 a, int r);
VCVTPS2UDQ __m512i _mm512_mask_cvt_roundps_epu32( __m512i s, __mmask16 k, __m512 a, int r);
VCVTPS2UDQ __m512i _mm512_maskz_cvt_roundps_epu32( __mmask16 k, __m512 a, int r);
VCVTPS2UDQ __m256i _mm256_cvtps_epu32( __m256d a);
VCVTPS2UDQ __m256i _mm256_mask_cvtps_epu32( __m256i s, __mmask8 k, __m256 a);
VCVTPS2UDQ __m256i _mm256_maskz_cvtps_epu32( __mmask8 k, __m256 a);
VCVTPS2UDQ __m128i _mm_cvtps_epu32( __m128 a);
VCVTPS2UDQ __m128i _mm_mask_cvtps_epu32( __m128i s, __mmask8 k, __m128 a);
VCVTPS2UDQ __m128i _mm_maskz_cvtps_epu32( __mmask8 k, __m128 a);
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTPS2UDQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Doubleword Integer Values Vol. 2C 5-79
VCVTPS2UQQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned
Quadword Integer Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F.W0 79 /r A V/V (AVX512VL AND Convert two packed single precision floating-point values
VCVTPS2UQQ xmm1 {k1}{z}, AVX512DQ) OR from zmm2/m64/m32bcst to two packed unsigned
xmm2/m64/m32bcst AVX10.11 quadword values in zmm1 subject to writemask k1.
EVEX.256.66.0F.W0 79 /r A V/V (AVX512VL AND Convert four packed single precision floating-point values
VCVTPS2UQQ ymm1 {k1}{z}, AVX512DQ) OR from xmm2/m128/m32bcst to four packed unsigned
xmm2/m128/m32bcst AVX10.11 quadword values in ymm1 subject to writemask k1.
EVEX.512.66.0F.W0 79 /r A V/V AVX512DQ Convert eight packed single precision floating-point values
VCVTPS2UQQ zmm1 {k1}{z}, OR AVX10.11 from ymm2/m256/m32bcst to eight packed unsigned
ymm2/m256/m32bcst {er} quadword values in zmm1 subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts up to eight packed single precision floating-point values in the source operand to unsigned quadword
integers in the destination operand.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is
returned, where w represents the number of bits in the destination format.
The source operand is a YMM/XMM/XMM (low 64- bits) register or a 256/128/64-bit memory location. The destina-
tion operation is a ZMM/YMM/XMM register conditionally updated with writemask k1.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
VCVTPS2UQQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL == 512) AND (EVEX.b == 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Single_Precision_To_UQuadInteger(SRC[k+31:k])
ELSE
VCVTPS2UQQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values Vol. 2C 5-80
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] :=
Convert_Single_Precision_To_UQuadInteger(SRC[31:0])
ELSE
DEST[i+63:i] :=
Convert_Single_Precision_To_UQuadInteger(SRC[k+31:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VCVTPS2UQQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values Vol. 2C 5-81
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTPS2UQQ—Convert Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values Vol. 2C 5-82
VCVTQQ2PD—Convert Packed Quadword Integers to Packed Double Precision Floating-Point
Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F3.0F.W1 E6 /r A V/V (AVX512VL AND Convert two packed quadword integers from
VCVTQQ2PD xmm1 {k1}{z}, AVX512DQ) OR xmm2/m128/m64bcst to packed double precision
xmm2/m128/m64bcst AVX10.11 floating-point values in xmm1 with writemask k1.
EVEX.256.F3.0F.W1 E6 /r A V/V (AVX512VL AND Convert four packed quadword integers from
VCVTQQ2PD ymm1 {k1}{z}, AVX512DQ) OR ymm2/m256/m64bcst to packed double precision
ymm2/m256/m64bcst AVX10.11 floating-point values in ymm1 with writemask k1.
EVEX.512.F3.0F.W1 E6 /r A V/V AVX512DQ Convert eight packed quadword integers from
VCVTQQ2PD zmm1 {k1}{z}, OR AVX10.11 zmm2/m512/m64bcst to eight packed double precision
zmm2/m512/m64bcst {er} floating-point values in zmm1 with writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts packed quadword integers in the source operand (second operand) to packed double precision floating-
point values in the destination operand (first operand).
The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operation
is a ZMM/YMM/XMM register conditionally updated with writemask k1.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
VCVTQQ2PD (EVEX2 Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL == 512) AND (EVEX.b == 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_QuadInteger_To_Double_Precision_Floating_Point(SRC[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
VCVTQQ2PD—Convert Packed Quadword Integers to Packed Double Precision Floating-Point Values Vol. 2C 5-82
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTQQ2PD—Convert Packed Quadword Integers to Packed Double Precision Floating-Point Values Vol. 2C 5-83
VCVTQQ2PH—Convert Packed Signed Quadword Integers to Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.MAP5.W1 5B /r A V/V (AVX512-FP16 Convert two packed signed quadword integers in
VCVTQQ2PH xmm1{k1}{z}, AND AVX512VL) xmm2/m128/m64bcst to packed FP16 values,
xmm2/m128/m64bcst OR AVX10.11 and store the result in xmm1 subject to
writemask k1.
EVEX.256.NP.MAP5.W1 5B /r A V/V (AVX512-FP16 Convert four packed signed quadword integers in
VCVTQQ2PH xmm1{k1}{z}, AND AVX512VL) ymm2/m256/m64bcst to packed FP16 values,
ymm2/m256/m64bcst OR AVX10.11 and store the result in xmm1 subject to
writemask k1.
EVEX.512.NP.MAP5.W1 5B /r A V/V AVX512-FP16 Convert eight packed signed quadword integers in
VCVTQQ2PH xmm1{k1}{z}, OR AVX10.11 zmm2/m512/m64bcst to packed FP16 values,
zmm2/m512/m64bcst {er} and store the result in xmm1 subject to
writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed signed quadword integers in the source operand to packed FP16 values in the desti-
nation operand. The destination elements are updated according to the writemask.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
If the result of the convert operation is overflow and MXCSR.OM=0 then a SIMD exception will be raised with OE=1,
PE=1.
Operation
VCVTQQ2PH DEST, SRC
VL = 128, 256 or 512
KL := VL / 64
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.qword[0]
ELSE
tsrc := SRC.qword[j]
DEST.fp16[j] := Convert_integer64_to_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
VCVTQQ2PH—Convert Packed Signed Quadword Integers to Packed FP16 Values Vol. 2C 5-84
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL/4] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VCVTQQ2PH—Convert Packed Signed Quadword Integers to Packed FP16 Values Vol. 2C 5-85
VCVTQQ2PS—Convert Packed Quadword Integers to Packed Single Precision Floating-Point
Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.0F.W1 5B /r A V/V (AVX512VL AND Convert two packed quadword integers from xmm2/mem
VCVTQQ2PS xmm1 {k1}{z}, AVX512DQ) OR to packed single precision floating-point values in xmm1
xmm2/m128/m64bcst AVX10.11 with writemask k1.
EVEX.256.0F.W1 5B /r A V/V (AVX512VL AND Convert four packed quadword integers from ymm2/mem
VCVTQQ2PS xmm1 {k1}{z}, AVX512DQ) OR to packed single precision floating-point values in xmm1
ymm2/m256/m64bcst AVX10.11 with writemask k1.
EVEX.512.0F.W1 5B /r A V/V AVX512DQ Convert eight packed quadword integers from
VCVTQQ2PS ymm1 {k1}{z}, OR AVX10.11 zmm2/mem to eight packed single precision floating-point
zmm2/m512/m64bcst {er} values in ymm1 with writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts packed quadword integers in the source operand (second operand) to packed single precision floating-
point values in the destination operand (first operand).
The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operation
is a YMM/XMM/XMM (lower 64 bits) register conditionally updated with writemask k1.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
VCVTQQ2PS (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[k+31:k] :=
Convert_QuadInteger_To_Single_Precision_Floating_Point(SRC[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[k+31:k] remains unchanged*
ELSE ; zeroing-masking
DEST[k+31:k] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0
VCVTQQ2PS—Convert Packed Quadword Integers to Packed Single Precision Floating-Point Values Vol. 2C 5-86
VCVTQQ2PS (EVEX Encoded Versions) When SRC Operand is a Memory Source
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[k+31:k] :=
Convert_QuadInteger_To_Single_Precision_Floating_Point(SRC[63:0])
ELSE
DEST[k+31:k] :=
Convert_QuadInteger_To_Single_Precision_Floating_Point(SRC[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[k+31:k] remains unchanged*
ELSE ; zeroing-masking
DEST[k+31:k] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTQQ2PS—Convert Packed Quadword Integers to Packed Single Precision Floating-Point Values Vol. 2C 5-87
VCVTSD2SH—Convert Low FP64 Value to an FP16 Value
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F2.MAP5.W1 5A /r A V/V AVX512-FP16 Convert the low FP64 value in xmm3/m64 to an
VCVTSD2SH xmm1{k1}{z}, xmm2, OR AVX10.11 FP16 value and store the result in the low
xmm3/m64 {er} element of xmm1 subject to writemask k1. Bits
127:16 of xmm2 are copied to xmm1[127:16].
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts the low FP64 value in the second source operand to an FP16 value, and stores the result in
the low element of the destination operand.
When the conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
Operation
VCVTSD2SH dest, src1, src2
IF *SRC2 is a register* and (EVEX.b = 1):
SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)
DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:128] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
2. EVEX.W1 in non-64 bit is ignored; the instruction behaves as if the W0 version is used.
Description
Converts a double precision floating-point value in the source operand (the second operand) to an unsigned
doubleword integer in the destination operand (the first operand). The source operand can be an XMM register or
a 64-bit memory location. The destination operand is a general-purpose register. When the source operand is an
XMM register, the double precision floating-point value is contained in the low quadword of the register.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is
returned, where w represents the number of bits in the destination format.
Operation
VCVTSD2USI (EVEX Encoded Version)
IF (SRC *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF 64-Bit Mode and OperandSize = 64
THEN DEST[63:0] := Convert_Double_Precision_Floating_Point_To_UInteger(SRC[63:0]);
ELSE DEST[31:0] := Convert_Double_Precision_Floating_Point_To_UInteger(SRC[63:0]);
FI
VCVTSD2USI—Convert Scalar Double Precision Floating-Point Value to Unsigned Doubleword Integer Vol. 2C 5-89
SIMD Floating-Point Exceptions
Invalid, Precision.
Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
VCVTSD2USI—Convert Scalar Double Precision Floating-Point Value to Unsigned Doubleword Integer Vol. 2C 5-90
VCVTSH2SD—Convert Low FP16 Value to an FP64 Value
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 5A /r A V/V AVX512-FP16 Convert the low FP16 value in xmm3/m16 to an
VCVTSH2SD xmm1{k1}{z}, xmm2, OR AVX10.11 FP64 value and store the result in the low
xmm3/m16 {sae} element of xmm1 subject to writemask k1. Bits
127:64 of xmm2 are copied to xmm1[127:64].
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts the low FP16 element in the second source operand to a FP64 element in the low element
of the destination operand.
Bits 127:64 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP64 element of the destination is updated according
to the writemask.
Operation
VCVTSH2SD dest, src1, src2
IF k1[0] OR *no writemask*:
DEST.fp64[0] := Convert_fp16_to_fp64(SRC2.fp16[0])
ELSE IF *zeroing*:
DEST.fp64[0] := 0
// else dest.fp64[0] remains unchanged
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. Outside of 64b mode, the EVEX.W field is ignored. The instruction behaves as if W=0 was used.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts the low FP16 element in the source operand to a signed integer in the destination general
purpose register.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the integer indefinite value is
returned.
Operation
VCVTSH2SI dest, src
IF *SRC is a register* and (EVEX.b = 1):
SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)
Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts the low FP16 element in the second source operand to the low FP32 element of the desti-
nation operand.
Bits 127:32 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
Operation
VCVTSH2SS dest, src1, src2
IF k1[0] OR *no writemask*:
DEST.fp32[0] := Convert_fp16_to_fp32(SRC2.fp16[0])
ELSE IF *zeroing*:
DEST.fp32[0] := 0
// else dest.fp32[0] remains unchanged
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. Outside of 64b mode, the EVEX.W field is ignored. The instruction behaves as if W=0 was used.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts the low FP16 element in the source operand to an unsigned integer in the destination
general purpose register.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the integer indefinite value is
returned.
Operation
VCVTSH2USI dest, src
// SET_RM() sets the rounding mode used for this instruction.
IF *SRC is a register* and (EVEX.b = 1):
SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)
Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
NOTES:
1. Outside of 64b mode, the EVEX.W field is ignored. The instruction behaves as if W=0 was used.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts a signed doubleword integer (or signed quadword integer if operand size is 64 bits) in the
second source operand to an FP16 value in the destination operand. The result is stored in the low word of the desti-
nation operand. When conversion is inexact, the value returned is rounded according to the rounding control bits
in the MXCSR register or embedded rounding controls.
The second source operand can be a general-purpose register or a 32/64-bit memory location. The first source and
destination operands are XMM registers. Bits 127:16 of the XMM register destination are copied from corre-
sponding bits in the first source operand. Bits MAXVL-1:128 of the destination register are zeroed.
If the result of the convert operation is overflow and MXCSR.OM=0 then a SIMD exception will be raised with OE=1,
PE=1.
Operation
VCVTSI2SH dest, src1, src2
IF *SRC2 is a register* and (EVEX.b = 1):
SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)
DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:128] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts the low FP32 value in the second source operand to a FP16 value in the low element of the
destination operand.
When the conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
Operation
VCVTSS2SH dest, src1, src2
IF *SRC2 is a register* and (EVEX.b = 1):
SET_RM(EVEX.RC)
ELSE:
SET_RM(MXCSR.RC)
DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:128] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
2. EVEX.W1 in non-64 bit is ignored; the instruction behaves as if the W0 version is used.
Description
Converts a single precision floating-point value in the source operand (the second operand) to an unsigned double-
word integer (or unsigned quadword integer if operand size is 64 bits) in the destination operand (the first
operand). The source operand can be an XMM register or a memory location. The destination operand is a general-
purpose register. When the source operand is an XMM register, the single precision floating-point value is contained
in the low doubleword of the register.
When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR
register or the embedded rounding control bits. If a converted result cannot be represented in the destination
format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is
returned, where w represents the number of bits in the destination format.
VEX.W1 and EVEX.W1 versions: promotes the instruction to produce 64-bit data in 64-bit mode.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Operation
VCVTSS2USI (EVEX Encoded Version)
IF (SRC *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF 64-bit Mode and OperandSize = 64
THEN
DEST[63:0] := Convert_Single_Precision_Floating_Point_To_UInteger(SRC[31:0]);
ELSE
DEST[31:0] := Convert_Single_Precision_Floating_Point_To_UInteger(SRC[31:0]);
FI;
VCVTSS2USI—Convert Scalar Single Precision Floating-Point Value to Unsigned Doubleword Integer Vol. 2C 5-98
Intel C/C++ Compiler Intrinsic Equivalent
VCVTSS2USI unsigned _mm_cvtss_u32( __m128 a);
VCVTSS2USI unsigned _mm_cvt_roundss_u32( __m128 a, int r);
VCVTSS2USI unsigned __int64 _mm_cvtss_u64( __m128 a);
VCVTSS2USI unsigned __int64 _mm_cvt_roundss_u64( __m128 a, int r);
Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
VCVTSS2USI—Convert Scalar Single Precision Floating-Point Value to Unsigned Doubleword Integer Vol. 2C 5-99
VCVTTPD2QQ—Convert With Truncation Packed Double Precision Floating-Point Values to
Packed Quadword Integers
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F.W1 7A /r A V/V (AVX512VL AND Convert two packed double precision floating-point
VCVTTPD2QQ xmm1 {k1}{z}, AVX512DQ) OR values from zmm2/m128/m64bcst to two packed
xmm2/m128/m64bcst AVX10.11 quadword integers in zmm1 using truncation with
writemask k1.
EVEX.256.66.0F.W1 7A /r A V/V (AVX512VL AND Convert four packed double precision floating-point
VCVTTPD2QQ ymm1 {k1}{z}, AVX512DQ) OR values from ymm2/m256/m64bcst to four packed
ymm2/m256/m64bcst AVX10.11 quadword integers in ymm1 using truncation with
writemask k1.
EVEX.512.66.0F.W1 7A /r A V/V AVX512DQ Convert eight packed double precision floating-point
VCVTTPD2QQ zmm1 {k1}{z}, OR AVX10.11 values from zmm2/m512 to eight packed quadword
zmm2/m512/m64bcst {sae} integers in zmm1 using truncation with writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts with truncation packed double precision floating-point values in the source operand (second operand) to
packed quadword integers in the destination operand (first operand).
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the indefinite integer value (2w-1, where w represents the number of bits in the destination format) is returned.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Operation
VCVTTPD2QQ (EVEX Encoded Version) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Double_Precision_Floating_Point_To_QuadInteger_Truncate(SRC[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
VCVTTPD2QQ—Convert With Truncation Packed Double Precision Floating-Point Values to Packed Quadword Integers Vol. 2C 5-100
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] := Convert_Double_Precision_Floating_Point_To_QuadInteger_Truncate(SRC[63:0])
ELSE
DEST[i+63:i] := Convert_Double_Precision_Floating_Point_To_QuadInteger_Truncate(SRC[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTTPD2QQ—Convert With Truncation Packed Double Precision Floating-Point Values to Packed Quadword Integers Vol. 2C 5-101
VCVTTPD2UDQ—Convert With Truncation Packed Double Precision Floating-Point Values to
Packed Unsigned Doubleword Integers
Opcode Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.0F.W1 78 /r A V/V (AVX512VL AND Convert two packed double precision floating-point
VCVTTPD2UDQ xmm1 {k1}{z}, AVX512F) OR values in xmm2/m128/m64bcst to two unsigned
xmm2/m128/m64bcst AVX10.11 doubleword integers in xmm1 using truncation
subject to writemask k1.
EVEX.256.0F.W1 78 02 /r A V/V (AVX512VL AND Convert four packed double precision floating-point
VCVTTPD2UDQ xmm1 {k1}{z}, AVX512F) OR values in ymm2/m256/m64bcst to four unsigned
ymm2/m256/m64bcst AVX10.11 doubleword integers in xmm1 using truncation
subject to writemask k1.
EVEX.512.0F.W1 78 /r A V/V AVX512F Convert eight packed double precision floating-point
VCVTTPD2UDQ ymm1 {k1}{z}, OR AVX10.11 values in zmm2/m512/m64bcst to eight unsigned
zmm2/m512/m64bcst {sae} doubleword integers in ymm1 using truncation
subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts with truncation packed double precision floating-point values in the source operand (the second operand)
to packed unsigned doubleword integers in the destination operand (the first operand).
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.
The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand is a YMM/XMM/XMM (low 64 bits) register
conditionally updated with writemask k1. The upper bits (MAXVL-1:256) of the corresponding destination are
zeroed.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Operation
VCVTTPD2UDQ (EVEX Encoded Versions) When SRC2 Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[k+63:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
VCVTTPD2UDQ—Convert With Truncation Packed Double Precision Floating-Point Values to Packed Unsigned Doubleword Integers Vol. 2C 5-102
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[63:0])
ELSE
DEST[i+31:i] :=
Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[k+63:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTTPD2UDQ—Convert With Truncation Packed Double Precision Floating-Point Values to Packed Unsigned Doubleword Integers Vol. 2C 5-103
VCVTTPD2UQQ—Convert With Truncation Packed Double Precision Floating-Point Values to
Packed Unsigned Quadword Integers
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F.W1 78 /r A V/V (AVX512VL AND Convert two packed double precision floating-point
VCVTTPD2UQQ xmm1 {k1}{z}, AVX512DQ) OR values from xmm2/m128/m64bcst to two packed
xmm2/m128/m64bcst AVX10.11 unsigned quadword integers in xmm1 using truncation
with writemask k1.
EVEX.256.66.0F.W1 78 /r A V/V (AVX512VL AND Convert four packed double precision floating-point
VCVTTPD2UQQ ymm1 {k1}{z}, AVX512DQ) OR values from ymm2/m256/m64bcst to four packed
ymm2/m256/m64bcst AVX10.11 unsigned quadword integers in ymm1 using truncation
with writemask k1.
EVEX.512.66.0F.W1 78 /r A V/V AVX512DQ Convert eight packed double precision floating-point
VCVTTPD2UQQ zmm1 {k1}{z}, OR AVX10.11 values from zmm2/mem to eight packed unsigned
zmm2/m512/m64bcst {sae} quadword integers in zmm1 using truncation with
writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts with truncation packed double precision floating-point values in the source operand (second operand) to
packed unsigned quadword integers in the destination operand (first operand).
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The destination operation is a ZMM/YMM/XMM register conditionally updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Operation
VCVTTPD2UQQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Double_Precision_Floating_Point_To_UQuadInteger_Truncate(SRC[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
VCVTTPD2UQQ—Convert With Truncation Packed Double Precision Floating-Point Values to Packed Unsigned Quadword Integers Vol. 2C 5-104
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] :=
Convert_Double_Precision_Floating_Point_To_UQuadInteger_Truncate(SRC[63:0])
ELSE
DEST[i+63:i] :=
Convert_Double_Precision_Floating_Point_To_UQuadInteger_Truncate(SRC[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTTPD2UQQ—Convert With Truncation Packed Double Precision Floating-Point Values to Packed Unsigned Quadword Integers Vol. 2C 5-105
VCVTTPH2DQ—Convert with Truncation Packed FP16 Values to Signed Doubleword Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F3.MAP5.W0 5B /r A V/V (AVX512-FP16 Convert four packed FP16 values in
VCVTTPH2DQ xmm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to four signed doubleword
xmm2/m64/m16bcst OR AVX10.11 integers, and store the result in xmm1 using
truncation subject to writemask k1.
EVEX.256.F3.MAP5.W0 5B /r A V/V (AVX512-FP16 Convert eight packed FP16 values in
VCVTTPH2DQ ymm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to eight signed
xmm2/m128/m16bcst OR AVX10.11 doubleword integers, and store the result in
ymm1 using truncation subject to writemask k1.
EVEX.512.F3.MAP5.W0 5B /r A V/V AVX512-FP16 Convert sixteen packed FP16 values in
VCVTTPH2DQ zmm1{k1}{z}, OR AVX10.11 ymm2/m256/m16bcst to sixteen signed
ymm2/m256/m16bcst {sae} doubleword integers, and store the result in
zmm1 using truncation subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed FP16 values in the source operand to signed doubleword integers in destination
operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result is larger than
the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is
masked, the indefinite integer value is returned.
The destination elements are updated according to the writemask.
Operation
VCVTTPH2DQ dest, src
VL = 128, 256 or 512
KL := VL / 32
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.fp32[j] := Convert_fp16_to_integer32_truncate(tsrc)
ELSE IF *zeroing*:
DEST.fp32[j] := 0
// else dest.fp32[j] remains unchanged
DEST[MAXVL-1:VL] := 0
VCVTTPH2DQ—Convert with Truncation Packed FP16 Values to Signed Doubleword Integers Vol. 2C 5-106
Intel C/C++ Compiler Intrinsic Equivalent
VCVTTPH2DQ __m512i _mm512_cvtt_roundph_epi32 (__m256h a, int sae);
VCVTTPH2DQ __m512i _mm512_mask_cvtt_roundph_epi32 (__m512i src, __mmask16 k, __m256h a, int sae);
VCVTTPH2DQ __m512i _mm512_maskz_cvtt_roundph_epi32 (__mmask16 k, __m256h a, int sae);
VCVTTPH2DQ __m128i _mm_cvttph_epi32 (__m128h a);
VCVTTPH2DQ __m128i _mm_mask_cvttph_epi32 (__m128i src, __mmask8 k, __m128h a);
VCVTTPH2DQ __m128i _mm_maskz_cvttph_epi32 (__mmask8 k, __m128h a);
VCVTTPH2DQ __m256i _mm256_cvttph_epi32 (__m128h a);
VCVTTPH2DQ __m256i _mm256_mask_cvttph_epi32 (__m256i src, __mmask8 k, __m128h a);
VCVTTPH2DQ __m256i _mm256_maskz_cvttph_epi32 (__mmask8 k, __m128h a);
VCVTTPH2DQ __m512i _mm512_cvttph_epi32 (__m256h a);
VCVTTPH2DQ __m512i _mm512_mask_cvttph_epi32 (__m512i src, __mmask16 k, __m256h a);
VCVTTPH2DQ __m512i _mm512_maskz_cvttph_epi32 (__mmask16 k, __m256h a);
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VCVTTPH2DQ—Convert with Truncation Packed FP16 Values to Signed Doubleword Integers Vol. 2C 5-107
VCVTTPH2QQ—Convert with Truncation Packed FP16 Values to Signed Quadword Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP5.W0 7A /r A V/V (AVX512-FP16 Convert two packed FP16 values in
VCVTTPH2QQ xmm1{k1}{z}, AND AVX512VL) xmm2/m32/m16bcst to two signed quadword
xmm2/m32/m16bcst OR AVX10.11 integers, and store the result in xmm1 using
truncation subject to writemask k1.
EVEX.256.66.MAP5.W0 7A /r A V/V (AVX512-FP16 Convert four packed FP16 values in
VCVTTPH2QQ ymm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to four signed quadword
xmm2/m64/m16bcst OR AVX10.11 integers, and store the result in ymm1 using
truncation subject to writemask k1.
EVEX.512.66.MAP5.W0 7A /r A V/V AVX512-FP16 Convert eight packed FP16 values in
VCVTTPH2QQ zmm1{k1}{z}, OR AVX10.11 xmm2/m128/m16bcst to eight signed quadword
xmm2/m128/m16bcst {sae} integers, and store the result in zmm1 using
truncation subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed FP16 values in the source operand to signed quadword integers in the destination
operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the indefinite integer value is returned.
The destination elements are updated according to the writemask.
Operation
VCVTTPH2QQ dest, src
VL = 128, 256 or 512
KL := VL / 64
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.qword[j] := Convert_fp16_to_integer64_truncate(tsrc)
ELSE IF *zeroing*:
DEST.qword[j] := 0
// else dest.qword[j] remains unchanged
DEST[MAXVL-1:VL] := 0
VCVTTPH2QQ—Convert with Truncation Packed FP16 Values to Signed Quadword Integers Vol. 2C 5-108
Intel C/C++ Compiler Intrinsic Equivalent
VCVTTPH2QQ __m512i _mm512_cvtt_roundph_epi64 (__m128h a, int sae);
VCVTTPH2QQ __m512i _mm512_mask_cvtt_roundph_epi64 (__m512i src, __mmask8 k, __m128h a, int sae);
VCVTTPH2QQ __m512i _mm512_maskz_cvtt_roundph_epi64 (__mmask8 k, __m128h a, int sae);
VCVTTPH2QQ __m128i _mm_cvttph_epi64 (__m128h a);
VCVTTPH2QQ __m128i _mm_mask_cvttph_epi64 (__m128i src, __mmask8 k, __m128h a);
VCVTTPH2QQ __m128i _mm_maskz_cvttph_epi64 (__mmask8 k, __m128h a);
VCVTTPH2QQ __m256i _mm256_cvttph_epi64 (__m128h a);
VCVTTPH2QQ __m256i _mm256_mask_cvttph_epi64 (__m256i src, __mmask8 k, __m128h a);
VCVTTPH2QQ __m256i _mm256_maskz_cvttph_epi64 (__mmask8 k, __m128h a);
VCVTTPH2QQ __m512i _mm512_cvttph_epi64 (__m128h a);
VCVTTPH2QQ __m512i _mm512_mask_cvttph_epi64 (__m512i src, __mmask8 k, __m128h a);
VCVTTPH2QQ __m512i _mm512_maskz_cvttph_epi64 (__mmask8 k, __m128h a);
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VCVTTPH2QQ—Convert with Truncation Packed FP16 Values to Signed Quadword Integers Vol. 2C 5-109
VCVTTPH2UDQ—Convert with Truncation Packed FP16 Values to Unsigned Doubleword
Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 78 /r A V/V (AVX512-FP16 Convert four packed FP16 values in
VCVTTPH2UDQ xmm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to four unsigned
xmm2/m64/m16bcst OR AVX10.11 doubleword integers, and store the result in
xmm1 using truncation subject to writemask k1.
EVEX.256.NP.MAP5.W0 78 /r A V/V (AVX512-FP16 Convert eight packed FP16 values in
VCVTTPH2UDQ ymm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to eight unsigned
xmm2/m128/m16bcst OR AVX10.11 doubleword integers, and store the result in
ymm1 using truncation subject to writemask k1.
EVEX.512.NP.MAP5.W0 78 /r A V/V AVX512-FP16 Convert sixteen packed FP16 values in
VCVTTPH2UDQ zmm1{k1}{z}, OR AVX10.11 ymm2/m256/m16bcst to sixteen unsigned
ymm2/m256/m16bcst {sae} doubleword integers, and store the result in
zmm1 using truncation subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed FP16 values in the source operand to unsigned doubleword integers in the destina-
tion operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer indefinite value is returned.
The destination elements are updated according to the writemask.
Operation
VCVTTPH2UDQ dest, src
VL = 128, 256 or 512
KL := VL / 32
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.dword[j] := Convert_fp16_to_unsigned_integer32_truncate(tsrc)
ELSE IF *zeroing*:
DEST.dword[j] := 0
// else dest.dword[j] remains unchanged
DEST[MAXVL-1:VL] := 0
VCVTTPH2UDQ—Convert with Truncation Packed FP16 Values to Unsigned Doubleword Integers Vol. 2C 5-110
Intel C/C++ Compiler Intrinsic Equivalent
VCVTTPH2UDQ __m512i _mm512_cvtt_roundph_epu32 (__m256h a, int sae);
VCVTTPH2UDQ __m512i _mm512_mask_cvtt_roundph_epu32 (__m512i src, __mmask16 k, __m256h a, int sae);
VCVTTPH2UDQ __m512i _mm512_maskz_cvtt_roundph_epu32 (__mmask16 k, __m256h a, int sae);
VCVTTPH2UDQ __m128i _mm_cvttph_epu32 (__m128h a);
VCVTTPH2UDQ __m128i _mm_mask_cvttph_epu32 (__m128i src, __mmask8 k, __m128h a);
VCVTTPH2UDQ __m128i _mm_maskz_cvttph_epu32 (__mmask8 k, __m128h a);
VCVTTPH2UDQ __m256i _mm256_cvttph_epu32 (__m128h a);
VCVTTPH2UDQ __m256i _mm256_mask_cvttph_epu32 (__m256i src, __mmask8 k, __m128h a);
VCVTTPH2UDQ __m256i _mm256_maskz_cvttph_epu32 (__mmask8 k, __m128h a);
VCVTTPH2UDQ __m512i _mm512_cvttph_epu32 (__m256h a);
VCVTTPH2UDQ __m512i _mm512_mask_cvttph_epu32 (__m512i src, __mmask16 k, __m256h a);
VCVTTPH2UDQ __m512i _mm512_maskz_cvttph_epu32 (__mmask16 k, __m256h a);
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VCVTTPH2UDQ—Convert with Truncation Packed FP16 Values to Unsigned Doubleword Integers Vol. 2C 5-111
VCVTTPH2UQQ—Convert with Truncation Packed FP16 Values to Unsigned Quadword Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP5.W0 78 /r A V/V (AVX512-FP16 Convert two packed FP16 values in
VCVTTPH2UQQ xmm1{k1}{z}, AND AVX512VL) xmm2/m32/m16bcst to two unsigned quadword
xmm2/m32/m16bcst OR AVX10.11 integers, and store the result in xmm1 using
truncation subject to writemask k1.
EVEX.256.66.MAP5.W0 78 /r A V/V (AVX512-FP16 Convert four packed FP16 values in
VCVTTPH2UQQ ymm1{k1}{z}, AND AVX512VL) xmm2/m64/m16bcst to four unsigned quadword
xmm2/m64/m16bcst OR AVX10.11 integers, and store the result in ymm1 using
truncation subject to writemask k1.
EVEX.512.66.MAP5.W0 78 /r A V/V AVX512-FP16 Convert eight packed FP16 values in
VCVTTPH2UQQ zmm1{k1}{z}, OR AVX10.11 xmm2/m128/m16bcst to eight unsigned
xmm2/m128/m16bcst {sae} quadword integers, and store the result in zmm1
using truncation subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed FP16 values in the source operand to unsigned quadword integers in the destina-
tion operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer indefinite value is returned.
The destination elements are updated according to the writemask.
Operation
VCVTTPH2UQQ dest, src
VL = 128, 256 or 512
KL := VL / 64
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.qword[j] := Convert_fp16_to_unsigned_integer64_truncate(tsrc)
ELSE IF *zeroing*:
DEST.qword[j] := 0
// else dest.qword[j] remains unchanged
DEST[MAXVL-1:VL] := 0
VCVTTPH2UQQ—Convert with Truncation Packed FP16 Values to Unsigned Quadword Integers Vol. 2C 5-112
Intel C/C++ Compiler Intrinsic Equivalent
VCVTTPH2UQQ __m512i _mm512_cvtt_roundph_epu64 (__m128h a, int sae);
VCVTTPH2UQQ __m512i _mm512_mask_cvtt_roundph_epu64 (__m512i src, __mmask8 k, __m128h a, int sae);
VCVTTPH2UQQ __m512i _mm512_maskz_cvtt_roundph_epu64 (__mmask8 k, __m128h a, int sae);
VCVTTPH2UQQ __m128i _mm_cvttph_epu64 (__m128h a);
VCVTTPH2UQQ __m128i _mm_mask_cvttph_epu64 (__m128i src, __mmask8 k, __m128h a);
VCVTTPH2UQQ __m128i _mm_maskz_cvttph_epu64 (__mmask8 k, __m128h a);
VCVTTPH2UQQ __m256i _mm256_cvttph_epu64 (__m128h a);
VCVTTPH2UQQ __m256i _mm256_mask_cvttph_epu64 (__m256i src, __mmask8 k, __m128h a);
VCVTTPH2UQQ __m256i _mm256_maskz_cvttph_epu64 (__mmask8 k, __m128h a);
VCVTTPH2UQQ __m512i _mm512_cvttph_epu64 (__m128h a);
VCVTTPH2UQQ __m512i _mm512_mask_cvttph_epu64 (__m512i src, __mmask8 k, __m128h a);
VCVTTPH2UQQ __m512i _mm512_maskz_cvttph_epu64 (__mmask8 k, __m128h a);
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VCVTTPH2UQQ—Convert with Truncation Packed FP16 Values to Unsigned Quadword Integers Vol. 2C 5-113
VCVTTPH2UW—Convert Packed FP16 Values to Unsigned Word Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.MAP5.W0 7C /r A V/V (AVX512-FP16 Convert eight packed FP16 values in
VCVTTPH2UW xmm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to eight unsigned word
xmm2/m128/m16bcst OR AVX10.11 integers, and store the result in xmm1 using
truncation subject to writemask k1.
EVEX.256.NP.MAP5.W0 7C /r A V/V (AVX512-FP16 Convert sixteen packed FP16 values in
VCVTTPH2UW ymm1{k1}{z}, AND AVX512VL) ymm2/m256/m16bcst to sixteen unsigned word
ymm2/m256/m16bcst OR AVX10.11 integers, and store the result in ymm1 using
truncation subject to writemask k1.
EVEX.512.NP.MAP5.W0 7C /r A V/V AVX512-FP16 Convert thirty-two packed FP16 values in
VCVTTPH2UW zmm1{k1}{z}, OR AVX10.11 zmm2/m512/m16bcst to thirty-two unsigned
zmm2/m512/m16bcst {sae} word integers, and store the result in zmm1
using truncation subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed FP16 values in the source operand to unsigned word integers in the destination
operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer indefinite value is returned.
The destination elements are updated according to the writemask.
Operation
VCVTTPH2UW dest, src
VL = 128, 256 or 512
KL := VL / 16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.word[j] := Convert_fp16_to_unsigned_integer16_truncate(tsrc)
ELSE IF *zeroing*:
DEST.word[j] := 0
// else dest.word[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed FP16 values in the source operand to signed word integers in the destination
operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer indefinite value is returned.
The destination elements are updated according to the writemask.
Operation
VCVTTPH2W dest, src
VL = 128, 256 or 512
KL := VL / 16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.fp16[0]
ELSE
tsrc := SRC.fp16[j]
DEST.word[j] := Convert_fp16_to_integer16_truncate(tsrc)
ELSE IF *zeroing*:
DEST.word[j] := 0
// else dest.word[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts with truncation packed single precision floating-point values in the source operand to eight signed quad-
word integers in the destination operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the indefinite integer value (2w-1, where w represents the number of bits in the destination format) is returned.
EVEX encoded versions: The source operand is a YMM/XMM/XMM (low 64 bits) register or a 256/128/64-bit
memory location. The destination operation is a vector register conditionally updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
VCVTTPS2QQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Single_Precision_To_QuadInteger_Truncate(SRC[k+31:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
VCVTTPS2QQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Signed Quadword Integer Values Vol. 2C 5-118
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] :=
Convert_Single_Precision_To_QuadInteger_Truncate(SRC[31:0])
ELSE
DEST[i+63:i] :=
Convert_Single_Precision_To_QuadInteger_Truncate(SRC[k+31:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTTPS2QQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Signed Quadword Integer Values Vol. 2C 5-119
VCVTTPS2UDQ—Convert With Truncation Packed Single Precision Floating-Point Values to
Packed Unsigned Doubleword Integer Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.0F.W0 78 /r A V/V (AVX512VL AND Convert four packed single precision floating-
VCVTTPS2UDQ xmm1 {k1}{z}, AVX512F) OR point values from xmm2/m128/m32bcst to four
xmm2/m128/m32bcst AVX10.11 packed unsigned doubleword values in xmm1
using truncation subject to writemask k1.
EVEX.256.0F.W0 78 /r A V/V (AVX512VL AND Convert eight packed single precision floating-
VCVTTPS2UDQ ymm1 {k1}{z}, AVX512F) OR point values from ymm2/m256/m32bcst to eight
ymm2/m256/m32bcst AVX10.11 packed unsigned doubleword values in ymm1
using truncation subject to writemask k1.
EVEX.512.0F.W0 78 /r A V/V AVX512F Convert sixteen packed single precision floating-
VCVTTPS2UDQ zmm1 {k1}{z}, OR AVX10.11 point values from zmm2/m512/m32bcst to
zmm2/m512/m32bcst {sae} sixteen packed unsigned doubleword values in
zmm1 using truncation subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts with truncation packed single precision floating-point values in the source operand to sixteen unsigned
doubleword integers in the destination operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or
a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a
ZMM/YMM/XMM register conditionally updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
VCVTTPS2UDQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
VCVTTPS2UDQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Unsigned Doubleword Integer Val- Vol. 2C 5-120
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[31:0])
ELSE
DEST[i+31:i] :=
Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTTPS2UDQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Unsigned Doubleword Integer Val- Vol. 2C 5-121
VCVTTPS2UQQ—Convert With Truncation Packed Single Precision Floating-Point Values to
Packed Unsigned Quadword Integer Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F.W0 78 /r A V/V (AVX512VL AND Convert two packed single precision floating-point
VCVTTPS2UQQ xmm1 {k1}{z}, AVX512DQ) OR values from xmm2/m64/m32bcst to two packed
xmm2/m64/m32bcst AVX10.11 unsigned quadword values in xmm1 using truncation
subject to writemask k1.
EVEX.256.66.0F.W0 78 /r A V/V (AVX512VL AND Convert four packed single precision floating-point
VCVTTPS2UQQ ymm1 {k1}{z}, AVX512DQ) OR values from xmm2/m128/m32bcst to four packed
xmm2/m128/m32bcst AVX10.11 unsigned quadword values in ymm1 using truncation
subject to writemask k1.
EVEX.512.66.0F.W0 78 /r A V/V AVX512DQ Convert eight packed single precision floating-point
VCVTTPS2UQQ zmm1 {k1}{z}, OR AVX10.11 values from ymm2/m256/m32bcst to eight packed
ymm2/m256/m32bcst {sae} unsigned quadword values in zmm1 using truncation
subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts with truncation up to eight packed single precision floating-point values in the source operand to
unsigned quadword integers in the destination operand.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.
EVEX encoded versions: The source operand is a YMM/XMM/XMM (low 64 bits) register or a 256/128/64-bit
memory location. The destination operation is a vector register conditionally updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
VCVTTPS2UQQ (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_Single_Precision_To_UQuadInteger_Truncate(SRC[k+31:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
VCVTTPS2UQQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values Vol. 2C 5-122
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] :=
Convert_Single_Precision_To_UQuadInteger_Truncate(SRC[31:0])
ELSE
DEST[i+63:i] :=
Convert_Single_Precision_To_UQuadInteger_Truncate(SRC[k+31:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTTPS2UQQ—Convert With Truncation Packed Single Precision Floating-Point Values to Packed Unsigned Quadword Integer Values Vol. 2C 5-123
VCVTTSD2USI—Convert With Truncation Scalar Double Precision Floating-Point Value to
Unsigned Integer
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
EVEX.LLIG.F2.0F.W0 78 /r A V/V AVX512F Convert one double precision floating-point value
VCVTTSD2USI r32, xmm1/m64{sae} OR AVX10.11 from xmm1/m64 to one unsigned doubleword
integer r32 using truncation.
EVEX.LLIG.F2.0F.W1 78 /r A V/N.E.2 AVX512F Convert one double precision floating-point value
VCVTTSD2USI r64, xmm1/m64{sae} OR AVX10.11 from xmm1/m64 to one unsigned quadword
integer zero-extended into r64 using truncation.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
2. For this specific instruction, EVEX.W in non-64 bit is ignored; the instruction behaves as if the W0 version is used.
Description
Converts with truncation a double precision floating-point value in the source operand (the second operand) to an
unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the destination operand
(the first operand). The source operand can be an XMM register or a 64-bit memory location. The destination
operand is a general-purpose register. When the source operand is an XMM register, the double precision floating-
point value is contained in the low quadword of the register.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.
EVEX.W1 version: promotes the instruction to produce 64-bit data in 64-bit mode.
Operation
VCVTTSD2USI (EVEX Encoded Version)
IF 64-Bit Mode and OperandSize = 64
THEN DEST[63:0] := Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[63:0]);
ELSE DEST[31:0] := Convert_Double_Precision_Floating_Point_To_UInteger_Truncate(SRC[63:0]);
FI
Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
VCVTTSD2USI—Convert With Truncation Scalar Double Precision Floating-Point Value to Unsigned Integer Vol. 2C 5-124
VCVTTSH2SI—Convert with Truncation Low FP16 Value to a Signed Integer
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 2C /r A V/V1 AVX512-FP16 Convert FP16 value in the low element of
VCVTTSH2SI r32, xmm1/m16 {sae} OR AVX10.12 xmm1/m16 to a signed integer and store the
result in r32 using truncation.
EVEX.LLIG.F3.MAP5.W1 2C /r A V/N.E. AVX512-FP16 Convert FP16 value in the low element of
VCVTTSH2SI r64, xmm1/m16 {sae} OR AVX10.12 xmm1/m16 to a signed integer and store the
result in r64 using truncation.
NOTES:
1. Outside of 64b mode, the EVEX.W field is ignored. The instruction behaves as if W=0 was used.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts the low FP16 element in the source operand to a signed integer in the destination general
purpose register.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer indefinite value is returned.
Operation
VCVTTSH2SI dest, src
IF 64-mode and OperandSize == 64:
DEST.qword := Convert_fp16_to_integer64_truncate(SRC.fp16[0])
ELSE:
DEST.dword := Convert_fp16_to_integer32_truncate(SRC.fp16[0])
Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
VCVTTSH2SI—Convert with Truncation Low FP16 Value to a Signed Integer Vol. 2C 5-125
VCVTTSH2USI—Convert with Truncation Low FP16 Value to an Unsigned Integer
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F3.MAP5.W0 78 /r A V/V1 AVX512-FP16 Convert FP16 value in the low element of
VCVTTSH2USI r32, xmm1/m16 {sae} OR AVX10.12 xmm1/m16 to an unsigned integer and store the
result in r32 using truncation.
EVEX.LLIG.F3.MAP5.W1 78 /r A V/N.E. AVX512-FP16 Convert FP16 value in the low element of
VCVTTSH2USI r64, xmm1/m16 {sae} OR AVX10.12 xmm1/m16 to an unsigned integer and store the
result in r64 using truncation.
NOTES:
1. Outside of 64b mode, the EVEX.W field is ignored. The instruction behaves as if W=0 was used.
2. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts the low FP16 element in the source operand to an unsigned integer in the destination
general purpose register.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer indefinite value is returned.
Operation
VCVTTSH2USI dest, src
IF 64-mode and OperandSize == 64:
DEST.qword := Convert_fp16_to_unsigned_integer64_truncate(SRC.fp16[0])
ELSE:
DEST.dword := Convert_fp16_to_unsigned_integer32_truncate(SRC.fp16[0])
Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
VCVTTSH2USI—Convert with Truncation Low FP16 Value to an Unsigned Integer Vol. 2C 5-126
VCVTTSS2USI—Convert With Truncation Scalar Single Precision Floating-Point Value to
Unsigned Integer
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.F3.0F.W0 78 /r A V/V AVX512F Convert one single precision floating-point value
VCVTTSS2USI r32, xmm1/m32{sae} OR AVX10.11 from xmm1/m32 to one unsigned doubleword
integer in r32 using truncation.
EVEX.LLIG.F3.0F.W1 78 /r A V/N.E.2 AVX512F Convert one single precision floating-point value
VCVTTSS2USI r64, xmm1/m32{sae} OR AVX10.11 from xmm1/m32 to one unsigned quadword
integer in r64 using truncation.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
2. For this specific instruction, EVEX.W in non-64 bit is ignored; the instruction behaves as if the W0 version is used.
Description
Converts with truncation a single precision floating-point value in the source operand (the second operand) to an
unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the destination operand
(the first operand). The source operand can be an XMM register or a memory location. The destination operand is
a general-purpose register. When the source operand is an XMM register, the single precision floating-point value
is contained in the low doubleword of the register.
When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result cannot be
represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked,
the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.
EVEX.W1 version: promotes the instruction to produce 64-bit data in 64-bit mode.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Operation
VCVTTSS2USI (EVEX Encoded Version)
IF 64-bit Mode and OperandSize = 64
THEN
DEST[63:0] := Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[31:0]);
ELSE
DEST[31:0] := Convert_Single_Precision_Floating_Point_To_UInteger_Truncate(SRC[31:0]);
FI;
VCVTTSS2USI—Convert With Truncation Scalar Single Precision Floating-Point Value to Unsigned Integer Vol. 2C 5-127
Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
VCVTTSS2USI—Convert With Truncation Scalar Single Precision Floating-Point Value to Unsigned Integer Vol. 2C 5-128
VCVTUDQ2PD—Convert Packed Unsigned Doubleword Integers to Packed Double Precision
Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F3.0F.W0 7A /r A V/V (AVX512VL AND Convert two packed unsigned doubleword integers
VCVTUDQ2PD xmm1 {k1}{z}, AVX512F) OR from ymm2/m64/m32bcst to packed double
xmm2/m64/m32bcst AVX10.11 precision floating-point values in zmm1 with
writemask k1.
EVEX.256.F3.0F.W0 7A /r A V/V (AVX512VL AND Convert four packed unsigned doubleword integers
VCVTUDQ2PD ymm1 {k1}{z}, AVX512F) OR from xmm2/m128/m32bcst to packed double
xmm2/m128/m32bcst AVX10.11 precision floating-point values in zmm1 with
writemask k1.
EVEX.512.F3.0F.W0 7A /r A V/V AVX512F Convert eight packed unsigned doubleword integers
VCVTUDQ2PD zmm1 {k1}{z}, OR AVX10.11 from ymm2/m256/m32bcst to eight packed double
ymm2/m256/m32bcst precision floating-point values in zmm1 with
writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts packed unsigned doubleword integers in the source operand (second operand) to packed double precision
floating-point values in the destination operand (first operand).
The source operand is a YMM/XMM/XMM (low 64 bits) register, a 256/128/64-bit memory location or a
256/128/64-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM
register conditionally updated with writemask k1.
Attempt to encode this instruction with EVEX embedded rounding is ignored.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Operation
VCVTUDQ2PD (EVEX Encoded Versions) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_UInteger_To_Double_Precision_Floating_Point(SRC[k+31:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
VCVTUDQ2PD—Convert Packed Unsigned Doubleword Integers to Packed Double Precision Floating-Point Values Vol. 2C 5-128
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 64
k := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
Convert_UInteger_To_Double_Precision_Floating_Point(SRC[31:0])
ELSE
DEST[i+63:i] :=
Convert_UInteger_To_Double_Precision_Floating_Point(SRC[k+31:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-53, “Type E5 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTUDQ2PD—Convert Packed Unsigned Doubleword Integers to Packed Double Precision Floating-Point Values Vol. 2C 5-129
VCVTUDQ2PH—Convert Packed Unsigned Doubleword Integers to Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F2.MAP5.W0 7A /r A V/V (AVX512-FP16 Convert four packed unsigned doubleword
VCVTUDQ2PH xmm1{k1}{z}, AND AVX512VL) integers from xmm2/m128/m32bcst to packed
xmm2/m128/m32bcst OR AVX10.11 FP16 values, and store the result in xmm1
subject to writemask k1.
EVEX.256.F2.MAP5.W0 7A /r A V/V (AVX512-FP16 Convert eight packed unsigned doubleword
VCVTUDQ2PH xmm1{k1}{z}, AND AVX512VL) integers from ymm2/m256/m32bcst to packed
ymm2/m256/m32bcst OR AVX10.11 FP16 values, and store the result in xmm1
subject to writemask k1.
EVEX.512.F2.MAP5.W0 7A /r A V/V AVX512-FP16 Convert sixteen packed unsigned doubleword
VCVTUDQ2PH ymm1{k1}{z}, OR AVX10.11 integers from zmm2/m512/m32bcst to packed
zmm2/m512/m32bcst {er} FP16 values, and store the result in ymm1
subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed unsigned doubleword integers in the source operand to packed FP16 values in the
destination operand. The destination elements are updated according to the writemask.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
If the result of the convert operation is overflow and MXCSR.OM=0 then a SIMD exception will be raised with OE=1,
PE=1.
Operation
VCVTUDQ2PH dest, src
VL = 128, 256 or 512
KL := VL / 32
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.dword[0]
ELSE
tsrc := SRC.dword[j]
DEST.fp16[j] := Convert_unsigned_integer32_to_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
VCVTUDQ2PH—Convert Packed Unsigned Doubleword Integers to Packed FP16 Values Vol. 2C 5-130
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL/2] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VCVTUDQ2PH—Convert Packed Unsigned Doubleword Integers to Packed FP16 Values Vol. 2C 5-131
VCVTUDQ2PS—Convert Packed Unsigned Doubleword Integers to Packed Single Precision
Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F2.0F.W0 7A /r A V/V (AVX512VL AND Convert four packed unsigned doubleword integers
VCVTUDQ2PS xmm1 {k1}{z}, AVX512F) OR from xmm2/m128/m32bcst to packed single
xmm2/m128/m32bcst AVX10.11 precision floating-point values in xmm1 with
writemask k1.
EVEX.256.F2.0F.W0 7A /r A V/V (AVX512VL AND Convert eight packed unsigned doubleword integers
VCVTUDQ2PS ymm1 {k1}{z}, AVX512F) OR from ymm2/m256/m32bcst to packed single
ymm2/m256/m32bcst AVX10.11 precision floating-point values in zmm1 with
writemask k1.
EVEX.512.F2.0F.W0 7A /r A V/V AVX512F Convert sixteen packed unsigned doubleword
VCVTUDQ2PS zmm1 {k1}{z}, OR AVX10.11 integers from zmm2/m512/m32bcst to sixteen
zmm2/m512/m32bcst {er} packed single precision floating-point values in
zmm1 with writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts packed unsigned doubleword integers in the source operand (second operand) to single precision
floating-point values in the destination operand (first operand).
The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Operation
VCVTUDQ2PS (EVEX Encoded Version) When SRC Operand is a Register
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_UInteger_To_Single_Precision_Floating_Point(SRC[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
VCVTUDQ2PS—Convert Packed Unsigned Doubleword Integers to Packed Single Precision Floating-Point Values Vol. 2C 5-132
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_UInteger_To_Single_Precision_Floating_Point(SRC[31:0])
ELSE
DEST[i+31:i] :=
Convert_UInteger_To_Single_Precision_Floating_Point(SRC[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VCVTUDQ2PS—Convert Packed Unsigned Doubleword Integers to Packed Single Precision Floating-Point Values Vol. 2C 5-133
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTUDQ2PS—Convert Packed Unsigned Doubleword Integers to Packed Single Precision Floating-Point Values Vol. 2C 5-134
VCVTUQQ2PD—Convert Packed Unsigned Quadword Integers to Packed Double Precision
Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F3.0F.W1 7A /r A V/V (AVX512VL AND Convert two packed unsigned quadword integers from
VCVTUQQ2PD xmm1 {k1}{z}, AVX512DQ) OR xmm2/m128/m64bcst to two packed double precision
xmm2/m128/m64bcst AVX10.11 floating-point values in xmm1 with writemask k1.
EVEX.256.F3.0F.W1 7A /r A V/V (AVX512VL AND Convert four packed unsigned quadword integers from
VCVTUQQ2PD ymm1 {k1}{z}, AVX512DQ) OR ymm2/m256/m64bcst to packed double precision
ymm2/m256/m64bcst AVX10.11 floating-point values in ymm1 with writemask k1.
EVEX.512.F3.0F.W1 7A /r A V/V AVX512DQ Convert eight packed unsigned quadword integers
VCVTUQQ2PD zmm1 {k1}{z}, OR AVX10.11 from zmm2/m512/m64bcst to eight packed double
zmm2/m512/m64bcst {er} precision floating-point values in zmm1 with
writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts packed unsigned quadword integers in the source operand (second operand) to packed double precision
floating-point values in the destination operand (first operand).
The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcasted from a 64-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Operation
VCVTUQQ2PD (EVEX Encoded Version) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL == 512) AND (EVEX.b == 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
Convert_UQuadInteger_To_Double_Precision_Floating_Point(SRC[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
VCVTUQQ2PD—Convert Packed Unsigned Quadword Integers to Packed Double Precision Floating-Point Values Vol. 2C 5-134
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b == 1)
THEN
DEST[i+63:i] :=
Convert_UQuadInteger_To_Double_Precision_Floating_Point(SRC[63:0])
ELSE
DEST[i+63:i] :=
Convert_UQuadInteger_To_Double_Precision_Floating_Point(SRC[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTUQQ2PD—Convert Packed Unsigned Quadword Integers to Packed Double Precision Floating-Point Values Vol. 2C 5-135
VCVTUQQ2PH—Convert Packed Unsigned Quadword Integers to Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F2.MAP5.W1 7A /r A V/V (AVX512-FP16 Convert two packed unsigned doubleword
VCVTUQQ2PH xmm1{k1}{z}, AND AVX512VL) integers from xmm2/m128/m64bcst to packed
xmm2/m128/m64bcst OR AVX10.11 FP16 values, and store the result in xmm1
subject to writemask k1.
EVEX.256.F2.MAP5.W1 7A /r A V/V (AVX512-FP16 Convert four packed unsigned doubleword
VCVTUQQ2PH xmm1{k1}{z}, AND AVX512VL) integers from ymm2/m256/m64bcst to packed
ymm2/m256/m64bcst OR AVX10.11 FP16 values, and store the result in xmm1
subject to writemask k1.
EVEX.512.F2.MAP5.W1 7A /r A V/V AVX512-FP16 Convert eight packed unsigned doubleword
VCVTUQQ2PH xmm1{k1}{z}, OR AVX10.11 integers from zmm2/m512/m64bcst to packed
zmm2/m512/m64bcst {er} FP16 values, and store the result in xmm1
subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed unsigned quadword integers in the source operand to packed FP16 values in the
destination operand. The destination elements are updated according to the writemask.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
If the result of the convert operation is overflow and MXCSR.OM=0 then a SIMD exception will be raised with OE=1,
PE=1.
Operation
VCVTUQQ2PH dest, src
VL = 128, 256 or 512
KL := VL / 64
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.qword[0]
ELSE
tsrc := SRC.qword[j]
DEST.fp16[j] := Convert_unsigned_integer64_to_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
VCVTUQQ2PH—Convert Packed Unsigned Quadword Integers to Packed FP16 Values Vol. 2C 5-136
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL/4] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VCVTUQQ2PH—Convert Packed Unsigned Quadword Integers to Packed FP16 Values Vol. 2C 5-137
VCVTUQQ2PS—Convert Packed Unsigned Quadword Integers to Packed Single Precision
Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F2.0F.W1 7A /r A V/V (AVX512VL AND Convert two packed unsigned quadword integers from
VCVTUQQ2PS xmm1 {k1}{z}, AVX512DQ) OR xmm2/m128/m64bcst to packed single precision
xmm2/m128/m64bcst AVX10.11 floating-point values in zmm1 with writemask k1.
EVEX.256.F2.0F.W1 7A /r A V/V (AVX512VL AND Convert four packed unsigned quadword integers from
VCVTUQQ2PS xmm1 {k1}{z}, AVX512DQ) OR ymm2/m256/m64bcst to packed single precision
ymm2/m256/m64bcst AVX10.11 floating-point values in xmm1 with writemask k1.
EVEX.512.F2.0F.W1 7A /r A V/V AVX512DQ Convert eight packed unsigned quadword integers from
VCVTUQQ2PS ymm1 {k1}{z}, OR AVX10.11 zmm2/m512/m64bcst to eight packed single precision
zmm2/m512/m64bcst {er} floating-point values in zmm1 with writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts packed unsigned quadword integers in the source operand (second operand) to single precision floating-
point values in the destination operand (first operand).
EVEX encoded versions: The source operand is a ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The destination operand is a YMM/XMM/XMM (low 64 bits) register conditionally updated with writemask k1.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Operation
VCVTUQQ2PS (EVEX Encoded Version) When SRC Operand is a Register
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
Convert_UQuadInteger_To_Single_Precision_Floating_Point(SRC[k+63:k])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
VCVTUQQ2PS—Convert Packed Unsigned Quadword Integers to Packed Single Precision Floating-Point Values Vol. 2C 5-138
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0
FOR j := 0 TO KL-1
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
Convert_UQuadInteger_To_Single_Precision_Floating_Point(SRC[63:0])
ELSE
DEST[i+31:i] :=
Convert_UQuadInteger_To_Single_Precision_Floating_Point(SRC[k+63:k])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VCVTUQQ2PS—Convert Packed Unsigned Quadword Integers to Packed Single Precision Floating-Point Values Vol. 2C 5-139
VCVTUSI2SD—Convert Unsigned Integer to Scalar Double Precision Floating-Point Value
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
EVEX.LLIG.F2.0F.W0 7B /r A V/V AVX512F Convert one unsigned doubleword integer from
VCVTUSI2SD xmm1, xmm2, r/m32 OR AVX10.11 r/m32 to one double precision floating-point
value in xmm1.
EVEX.LLIG.F2.0F.W1 7B /r A V/N.E.2 AVX512F Convert one unsigned quadword integer from
VCVTUSI2SD xmm1, xmm2, r/m64{er} OR AVX10.11 r/m64 to one double precision floating-point
value in xmm1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
2. For this specific instruction, EVEX.W in non-64 bit is ignored; the instruction behaves as if the W0 version is used.
Description
Converts an unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the second
source operand to a double precision floating-point value in the destination operand. The result is stored in the low
quadword of the destination operand. When conversion is inexact, the value returned is rounded according to the
rounding control bits in the MXCSR register.
The second source operand can be a general-purpose register or a 32/64-bit memory location. The first source and
destination operands are XMM registers. Bits (127:64) of the XMM register destination are copied from corre-
sponding bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
EVEX.W1 version: promotes the instruction to use 64-bit input value in 64-bit mode.
EVEX.W0 version: attempt to encode this instruction with EVEX embedded rounding is ignored.
Operation
VCVTUSI2SD (EVEX Encoded Version)
IF (SRC2 *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF 64-Bit Mode And OperandSize = 64
THEN
DEST[63:0] := Convert_UInteger_To_Double_Precision_Floating_Point(SRC2[63:0]);
ELSE
DEST[63:0] := Convert_UInteger_To_Double_Precision_Floating_Point(SRC2[31:0]);
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0
VCVTUSI2SD—Convert Unsigned Integer to Scalar Double Precision Floating-Point Value Vol. 2C 5-140
Intel C/C++ Compiler Intrinsic Equivalent
VCVTUSI2SD __m128d _mm_cvtu32_sd( __m128d s, unsigned a);
VCVTUSI2SD __m128d _mm_cvtu64_sd( __m128d s, unsigned __int64 a);
VCVTUSI2SD __m128d _mm_cvt_roundu64_sd( __m128d s, unsigned __int64 a, int r);
Other Exceptions
See Table 2-50, “Type E3NF Class Exception Conditions” if W1; otherwise, see Table 2-61, “Type E10NF Class
Exception Conditions.”
VCVTUSI2SD—Convert Unsigned Integer to Scalar Double Precision Floating-Point Value Vol. 2C 5-141
VCVTUSI2SS—Convert Unsigned Integer to Scalar Single Precision Floating-Point Value
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
EVEX.LLIG.F3.0F.W0 7B /r A V/V AVX512F Convert one signed doubleword integer from r/m32
VCVTUSI2SS xmm1, xmm2, r/m32{er} OR AVX10.11 to one single precision floating-point value in
xmm1.
EVEX.LLIG.F3.0F.W1 7B /r A V/N.E.2 AVX512F Convert one signed quadword integer from r/m64
VCVTUSI2SS xmm1, xmm2, r/m64{er} OR AVX10.11 to one single precision floating-point value in
xmm1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
2. For this specific instruction, EVEX.W in non-64 bit is ignored; the instruction behaves as if the W0 version is used.
Description
Converts a unsigned doubleword integer (or unsigned quadword integer if operand size is 64 bits) in the source
operand (second operand) to a single precision floating-point value in the destination operand (first operand). The
source operand can be a general-purpose register or a memory location. The destination operand is an XMM
register. The result is stored in the low doubleword of the destination operand. When a conversion is inexact, the
value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding
control bits.
The second source operand can be a general-purpose register or a 32/64-bit memory location. The first source and
destination operands are XMM registers. Bits (127:32) of the XMM register destination are copied from corre-
sponding bits in the first source operand. Bits (MAXVL-1:128) of the destination register are zeroed.
EVEX.W1 version: promotes the instruction to use 64-bit input value in 64-bit mode.
Operation
VCVTUSI2SS (EVEX Encoded Version)
IF (SRC2 *is register*) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF 64-Bit Mode And OperandSize = 64
THEN
DEST[31:0] := Convert_UInteger_To_Single_Precision_Floating_Point(SRC[63:0]);
ELSE
DEST[31:0] := Convert_UInteger_To_Single_Precision_Floating_Point(SRC[31:0]);
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0
VCVTUSI2SS—Convert Unsigned Integer to Scalar Single Precision Floating-Point Value Vol. 2C 5-144
Intel C/C++ Compiler Intrinsic Equivalent
VCVTUSI2SS __m128 _mm_cvtu32_ss( __m128 s, unsigned a);
VCVTUSI2SS __m128 _mm_cvt_roundu32_ss( __m128 s, unsigned a, int r);
VCVTUSI2SS __m128 _mm_cvtu64_ss( __m128 s, unsigned __int64 a);
VCVTUSI2SS __m128 _mm_cvt_roundu64_ss( __m128 s, unsigned __int64 a, int r);
Other Exceptions
See Table 2-50, “Type E3NF Class Exception Conditions.”
VCVTUSI2SS—Convert Unsigned Integer to Scalar Single Precision Floating-Point Value Vol. 2C 5-145
VCVTUW2PH—Convert Packed Unsigned Word Integers to FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.F2.MAP5.W0 7D /r A V/V (AVX512-FP16 Convert eight packed unsigned word integers
VCVTUW2PH xmm1{k1}{z}, AND AVX512VL) from xmm2/m128/m16bcst to FP16 values, and
xmm2/m128/m16bcst OR AVX10.11 store the result in xmm1 subject to writemask k1.
EVEX.256.F2.MAP5.W0 7D /r A V/V (AVX512-FP16 Convert sixteen packed unsigned word integers
VCVTUW2PH ymm1{k1}{z}, AND AVX512VL) from ymm2/m256/m16bcst to FP16 values, and
ymm2/m256/m16bcst OR AVX10.11 store the result in ymm1 subject to writemask k1.
EVEX.512.F2.MAP5.W0 7D /r A V/V AVX512-FP16 Convert thirty-two packed unsigned word
VCVTUW2PH zmm1{k1}{z}, OR AVX10.11 integers from zmm2/m512/m16bcst to FP16
zmm2/m512/m16bcst {er} values, and store the result in zmm1 subject to
writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed unsigned word integers in the source operand to FP16 values in the destination
operand. When conversion is inexact, the value returned is rounded according to the rounding control bits in the
MXCSR register or embedded rounding controls.
The destination elements are updated according to the writemask.
If the result of the convert operation is overflow and MXCSR.OM=0 then a SIMD exception will be raised with OE=1,
PE=1.
Operation
VCVTUW2PH dest, src
VL = 128, 256 or 512
KL := VL / 16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.word[0]
ELSE
tsrc := SRC.word[j]
DEST.fp16[j] := Convert_unsignd_integer16_to_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts packed signed word integers in the source operand to FP16 values in the destination
operand. When conversion is inexact, the value returned is rounded according to the rounding control bits in the
MXCSR register or embedded rounding controls.
The destination elements are updated according to the writemask.
Operation
VCVTW2PH dest, src
VL = 128, 256 or 512
KL := VL / 16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *SRC is memory* and EVEX.b = 1:
tsrc := SRC.word[0]
ELSE
tsrc := SRC.word[j]
DEST.fp16[j] := Convert_integer16_to_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the processor at
run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector width and
as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Compute packed SAD (sum of absolute differences) word results of unsigned bytes from two 32-bit dword
elements. Packed SAD word results are calculated in multiples of qword superblocks, producing 4 SAD word results
in each 64-bit superblock of the destination register.
Within each super block of packed word results, the SAD results from two 32-bit dword elements are calculated as
follows:
• The lower two word results are calculated each from the SAD operation between a sliding dword element within
a qword superblock from an intermediate vector with a stationary dword element in the corresponding qword
superblock of the first source operand. The intermediate vector, see “Tmp1” in Figure 5-8, is constructed from
the second source operand the imm8 byte as shuffle control to select dword elements within a 128-bit lane of
the second source operand. The two sliding dword elements in a qword superblock of Tmp1 are located at byte
offset 0 and 1 within the superblock, respectively. The stationary dword element in the qword superblock from
the first source operand is located at byte offset 0.
• The next two word results are calculated each from the SAD operation between a sliding dword element within
a qword superblock from the intermediate vector Tmp1 with a second stationary dword element in the corre-
sponding qword superblock of the first source operand. The two sliding dword elements in a qword superblock
of Tmp1 are located at byte offset 2and 3 within the superblock, respectively. The stationary dword element in
the qword superblock from the first source operand is located at byte offset 4.
• The intermediate vector is constructed in 128-bits lanes. Within each 128-bit lane, each dword element of the
intermediate vector is selected by a two-bit field within the imm8 byte on the corresponding 128-bits of the
second source operand. The imm8 byte serves as dword shuffle control within each 128-bit lanes of the inter-
mediate vector and the second source operand, similarly to PSHUFD.
00B: DW0
01B: DW1
imm8 shuffle control 10B: DW2
11B: DW3
7 5 3 1 0
55 47 39 31 24 39 31 23 15 8
Tmp1 sliding dword Tmp1 sliding dword
63 55 47 39 32 31 23 15 7 0
Src1 stationary dword 0
Src1 stationary dword 1
_ _ _ _ _ _ _ _
abs abs abs abs abs abs abs abs
+ 47 39 31 23 16 +
31 23 15 7 0
Tmp1 sliding dword
Tmp1 sliding dword
63 55 47 39 32
31 23 15 7 0
Src1 stationary dword 1
Src1 stationary dword 0
_ _ _ _
_ _ _ _
abs abs abs abs
abs abs abs abs
+
+
63 47 31 15 0
Destination qword superblock
Operation
VDBPSADBW (EVEX Encoded Versions)
(KL, VL) = (8, 128), (16, 256), (32, 512)
Selection of quadruplets:
FOR I = 0 to VL step 128
TMP1[I+31:I] := select (SRC2[I+127: I], imm8[1:0])
TMP1[I+63: I+32] := select (SRC2[I+127: I], imm8[3:2])
TMP1[I+95: I+64] := select (SRC2[I+127: I], imm8[5:4])
TMP1[I+127: I+96] := select (SRC2[I+127: I], imm8[7:6])
END FOR
SAD of quadruplets:
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TMP_DEST[i+15:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
See Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction divides packed FP16 values from the first source operand by the corresponding elements in the
second source operand, storing the packed FP16 result in the destination operand. The destination elements are
updated according to the writemask.
Operation
VDIVPH (EVEX Encoded Versions) When SRC2 Operand is a Register
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.fp16[j] := SRC1.fp16[j] / SRC2.fp16[j]
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
DEST.fp16[j] := SRC1.fp16[j] / SRC2.fp16[0]
ELSE:
DEST.fp16[j] := SRC1.fp16[j] / SRC2.fp16[j]
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction divides the low FP16 value from the first source operand by the corresponding value in the second
source operand, storing the FP16 result in the destination operand. Bits 127:16 of the destination operand are
copied from the corresponding bits of the first source operand. Bits MAXVL-1:128 of the destination operand are
zeroed. The low FP16 element of the destination is updated according to the writemask.
Operation
VDIVSH (EVEX Encoded Versions)
IF EVEX.b = 1 and SRC2 is a register:
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)
DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:128] := 0
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a SIMD dot-product of two BF16 pairs and accumulates into a packed single precision
register.
“Round to nearest even” rounding mode is used when doing each accumulation of the FMA. Output denormals are
always flushed to zero and input denormals are always treated as zero. MXCSR is not consulted nor updated.
Operation
Define make_fp32(x):
// The x parameter is bfloat16. Pack it in to upper 16b of a dword. The bit pattern is a legal fp32 value. Return that bit pattern.
dword := 0
dword[31:16] := x
RETURN dword
VDPBF16PS—Dot Product of BF16 Pairs Accumulated Into Packed Single Precision Vol. 2C 5-156
VDPBF16PS srcdest, src1, src2
VL = (128, 256, 512)
KL = VL/32
origdest := srcdest
FOR i := 0 to KL-1:
IF k1[ i ] or *no writemask*:
IF src2 is memory and evex.b == 1:
t := src2.dword[0]
ELSE:
t := src2.dword[ i ]
// FP32 FMA with daz in, ftz out and RNE rounding. MXCSR neither consulted nor updated.
ELSE IF *zeroing*:
srcdest.dword[ i ] := 0
ELSE: // merge masking, dest element unchanged
srcdest.dword[ i ] := origdest.dword[ i ]
srcdest[MAXVL-1:VL] := 0
Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
VDPBF16PS—Dot Product of BF16 Pairs Accumulated Into Packed Single Precision Vol. 2C 5-157
VEXPANDPD—Load Sparse Packed Double Precision Floating-Point Values From Dense Memory
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F38.W1 88 /r A V/V (AVX512VL AND Expand packed double precision floating-point
VEXPANDPD xmm1 {k1}{z}, AVX512F) OR values from xmm2/m128 to xmm1 using
xmm2/m128 AVX10.11 writemask k1.
EVEX.256.66.0F38.W1 88 /r A V/V (AVX512VL AND Expand packed double precision floating-point
VEXPANDPD ymm1 {k1}{z}, ymm2/m256 AVX512F) OR values from ymm2/m256 to ymm1 using
AVX10.11 writemask k1.
EVEX.512.66.0F38.W1 88 /r A V/V AVX512F Expand packed double precision floating-point
VEXPANDPD zmm1 {k1}{z}, zmm2/m512 OR AVX10.11 values from zmm2/m512 to zmm1 using
writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Expand (load) up to 8/4/2, contiguous, double precision floating-point values of the input vector in the source
operand (the second operand) to sparse elements in the destination operand (the first operand) selected by the
writemask k1.
The destination operand is a ZMM/YMM/XMM register, the source operand can be a ZMM/YMM/XMM register or a
512/256/128-bit memory location.
The input vector starts from the lowest element in the source operand. The writemask register k1 selects the desti-
nation elements (a partial vector or sparse elements if less than 8 elements) to be replaced by the ascending
elements in the input vector. Destination elements not selected by the writemask k1 are either unmodified or
zeroed, depending on EVEX.z.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.
VEXPANDPD—Load Sparse Packed Double Precision Floating-Point Values From Dense Memory Vol. 2C 5-160
Operation
VEXPANDPD (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
k := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
DEST[i+63:i] := SRC[k+63:k];
k := k + 64
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
See Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VEXPANDPD—Load Sparse Packed Double Precision Floating-Point Values From Dense Memory Vol. 2C 5-161
VEXPANDPS—Load Sparse Packed Single Precision Floating-Point Values From Dense Memory
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F38.W0 88 /r A V/V (AVX512VL AND Expand packed single precision floating-point
VEXPANDPS xmm1 {k1}{z}, xmm2/m128 AVX512F) OR values from xmm2/m128 to xmm1 using
AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 88 /r A V/V (AVX512VL AND Expand packed single precision floating-point
VEXPANDPS ymm1 {k1}{z}, ymm2/m256 AVX512F) OR values from ymm2/m256 to ymm1 using
AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 88 /r A V/V AVX512F Expand packed single precision floating-point
VEXPANDPS zmm1 {k1}{z}, zmm2/m512 OR AVX10.11 values from zmm2/m512 to zmm1 using
writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Expand (load) up to 16/8/4, contiguous, single precision floating-point values of the input vector in the source
operand (the second operand) to sparse elements of the destination operand (the first operand) selected by the
writemask k1.
The destination operand is a ZMM/YMM/XMM register, the source operand can be a ZMM/YMM/XMM register or a
512/256/128-bit memory location.
The input vector starts from the lowest element in the source operand. The writemask k1 selects the destination
elements (a partial vector or sparse elements if less than 16 elements) to be replaced by the ascending elements
in the input vector. Destination elements not selected by the writemask k1 are either unmodified or zeroed,
depending on EVEX.z.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.
VEXPANDPS—Load Sparse Packed Single Precision Floating-Point Values From Dense Memory Vol. 2C 5-162
Operation
VEXPANDPS (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
k := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
DEST[i+31:i] := SRC[k+31:k];
k := k + 32
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
See Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VEXPANDPS—Load Sparse Packed Single Precision Floating-Point Values From Dense Memory Vol. 2C 5-163
VEXTRACTF128/VEXTRACTF32x4/VEXTRACTF64x2/VEXTRACTF32x8/VEXTRACTF64x4—
Extract Packed Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.256.66.0F3A.W0 19 /r ib A V/V AVX Extract 128 bits of packed floating-point values
VEXTRACTF128 xmm1/m128, ymm2, from ymm2 and store results in xmm1/m128.
imm8
EVEX.256.66.0F3A.W0 19 /r ib C V/V (AVX512VL AND Extract 128 bits of packed single precision
VEXTRACTF32X4 xmm1/m128 {k1}{z}, AVX512F) OR floating-point values from ymm2 and store
ymm2, imm8 AVX10.11 results in xmm1/m128 subject to writemask k1.
EVEX.512.66.0F3A.W0 19 /r ib C V/V AVX512F Extract 128 bits of packed single precision
VEXTRACTF32x4 xmm1/m128 {k1}{z}, OR AVX10.11 floating-point values from zmm2 and store
zmm2, imm8 results in xmm1/m128 subject to writemask k1.
EVEX.256.66.0F3A.W1 19 /r ib B V/V (AVX512VL AND Extract 128 bits of packed double precision
VEXTRACTF64X2 xmm1/m128 {k1}{z}, AVX512DQ) OR floating-point values from ymm2 and store
ymm2, imm8 AVX10.11 results in xmm1/m128 subject to writemask k1.
EVEX.512.66.0F3A.W1 19 /r ib B V/V AVX512DQ Extract 128 bits of packed double precision
VEXTRACTF64X2 xmm1/m128 {k1}{z}, OR AVX10.11 floating-point values from zmm2 and store
zmm2, imm8 results in xmm1/m128 subject to writemask k1.
EVEX.512.66.0F3A.W0 1B /r ib D V/V AVX512DQ Extract 256 bits of packed single precision
VEXTRACTF32X8 ymm1/m256 {k1}{z}, OR AVX10.11 floating-point values from zmm2 and store
zmm2, imm8 results in ymm1/m256 subject to writemask k1.
EVEX.512.66.0F3A.W1 1B /r ib C V/V AVX512F Extract 256 bits of packed double precision
VEXTRACTF64x4 ymm1/m256 {k1}{z}, OR AVX10.11 floating-point values from zmm2 and store
zmm2, imm8 results in ymm1/m256 subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VEXTRACTF128/VEXTRACTF32x4 and VEXTRACTF64x2 extract 128-bits of single precision floating-point values
from the source operand (the second operand) and store to the low 128-bit of the destination operand (the first
operand). The 128-bit data extraction occurs at an 128-bit granular offset specified by imm8[0] (256-bit) or
imm8[1:0] as the multiply factor. The destination may be either a vector register or an 128-bit memory location.
VEXTRACTF32x4: The low 128-bit of the destination operand is updated at 32-bit granularity according to the
writemask.
VEXTRACTF32x8 and VEXTRACTF64x4 extract 256-bits of double precision floating-point values from the source
operand (second operand) and store to the low 256-bit of the destination operand (the first operand). The 256-bit
data extraction occurs at an 256-bit granular offset specified by imm8[0] (256-bit) or imm8[0] as the multiply
factor The destination may be either a vector register or a 256-bit memory location.
Operation
VEXTRACTF32x4 (EVEX Encoded Versions) When Destination is a Register
VL = 256, 512
IF VL = 256
CASE (imm8[0]) OF
0: TMP_DEST[127:0] := SRC1[127:0]
1: TMP_DEST[127:0] := SRC1[255:128]
ESAC.
FI;
IF VL = 512
CASE (imm8[1:0]) OF
00: TMP_DEST[127:0] := SRC1[127:0]
01: TMP_DEST[127:0] := SRC1[255:128]
10: TMP_DEST[127:0] := SRC1[383:256]
11: TMP_DEST[127:0] := SRC1[511:384]
ESAC.
FI;
FOR j := 0 TO 3
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:128] := 0
FOR j := 0 TO 3
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE *DEST[i+31:i] remains unchanged* ; merging-masking
FI;
ENDFOR
FOR j := 0 TO 1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:128] := 0
FOR j := 0 TO 1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE *DEST[i+63:i] remains unchanged* ; merging-masking
FI;
ENDFOR
FOR j := 0 TO 7
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:256] := 0
FOR j := 0 TO 7
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE *DEST[i+31:i] remains unchanged* ; merging-masking
FI;
ENDFOR
FOR j := 0 TO 3
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE ; merging-masking
*DEST[i+63:i] remains unchanged*
FI;
ENDFOR
Other Exceptions
VEX-encoded instructions, see Table 2-23, “Type 6 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-56, “Type E6NF Class Exception Conditions.”
Additionally:
#UD IF VEX.L = 0.
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VEXTRACTI128/VEXTRACTI32x4 and VEXTRACTI64x2 extract 128-bits of doubleword integer values from the
source operand (the second operand) and store to the low 128-bit of the destination operand (the first operand).
The 128-bit data extraction occurs at an 128-bit granular offset specified by imm8[0] (256-bit) or imm8[1:0] as
the multiply factor. The destination may be either a vector register or an 128-bit memory location.
VEXTRACTI32x4: The low 128-bit of the destination operand is updated at 32-bit granularity according to the
writemask.
VEXTRACTI64x2: The low 128-bit of the destination operand is updated at 64-bit granularity according to the
writemask.
VEXTRACTI32x8 and VEXTRACTI64x4 extract 256-bits of quadword integer values from the source operand (the
second operand) and store to the low 256-bit of the destination operand (the first operand). The 256-bit data
Operation
VEXTRACTI32x4 (EVEX encoded versions) when destination is a register
VL = 256, 512
IF VL = 256
CASE (imm8[0]) OF
0: TMP_DEST[127:0] := SRC1[127:0]
1: TMP_DEST[127:0] := SRC1[255:128]
ESAC.
FI;
IF VL = 512
CASE (imm8[1:0]) OF
00: TMP_DEST[127:0] := SRC1[127:0]
01: TMP_DEST[127:0] := SRC1[255:128]
10: TMP_DEST[127:0] := SRC1[383:256]
11: TMP_DEST[127:0] := SRC1[511:384]
ESAC.
FI;
FOR j := 0 TO 3
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:128] := 0
FOR j := 0 TO 3
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE *DEST[i+31:i] remains unchanged* ; merging-masking
FI;
ENDFOR
FOR j := 0 TO 1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:128] := 0
FOR j := 0 TO 1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE *DEST[i+63:i] remains unchanged* ; merging-masking
FI;
ENDFOR
FOR j := 0 TO 7
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:256] := 0
FOR j := 0 TO 7
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE *DEST[i+31:i] remains unchanged* ; merging-masking
FI;
ENDFOR
FOR j := 0 TO 3
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:256] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-23, “Type 6 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-56, “Type E6NF Class Exception Conditions.”
Additionally:
#UD IF VEX.L = 0.
#UD If VEX.vvvv != 1111B or EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a complex multiply and accumulate operation. There are normal and complex conjugate
forms of the operation.
The broadcasting and masking for this operation is done on 32-bit quantities representing a pair of FP16 values.
Rounding is performed at every FMA (fused multiply and add) boundary. Execution occurs as if all MXCSR excep-
tions are masked. MXCSR status bits are updated to reflect exceptional conditions.
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF broadcasting and src2 is memory:
tsrc2.fp16[2*i+0] := src2.fp16[0]
tsrc2.fp16[2*i+1] := src2.fp16[1]
ELSE:
tsrc2.fp16[2*i+0] := src2.fp16[2*i+0]
tsrc2.fp16[2*i+1] := src2.fp16[2*i+1]
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
tmp[2*i+0] := dest.fp16[2*i+0] + src1.fp16[2*i+0] * tsrc2.fp16[2*i+0]
tmp[2*i+1] := dest.fp16[2*i+1] + src1.fp16[2*i+1] * tsrc2.fp16[2*i+0]
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
// conjugate version subtracts odd final term
dest.fp16[2*i+0] := tmp[2*i+0] + src1.fp16[2*i+1] * tsrc2.fp16[2*i+1]
dest.fp16[2*i+1] := tmp[2*i+1] - src1.fp16[2*i+0] * tsrc2.fp16[2*i+1]
ELSE IF *zeroing*:
dest.fp16[2*i+0] := 0
dest.fp16[2*i+1] := 0
DEST[MAXVL-1:VL] := 0
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF broadcasting and src2 is memory:
tsrc2.fp16[2*i+0] := src2.fp16[0]
tsrc2.fp16[2*i+1] := src2.fp16[1]
ELSE:
tsrc2.fp16[2*i+0] := src2.fp16[2*i+0]
tsrc2.fp16[2*i+1] := src2.fp16[2*i+1]
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
tmp[2*i+0] := dest.fp16[2*i+0] + src1.fp16[2*i+0] * tsrc2.fp16[2*i+0]
tmp[2*i+1] := dest.fp16[2*i+1] + src1.fp16[2*i+1] * tsrc2.fp16[2*i+0]
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
// non-conjugate version subtracts even term
dest.fp16[2*i+0] := tmp[2*i+0] - src1.fp16[2*i+1] * tsrc2.fp16[2*i+1]
dest.fp16[2*i+1] := tmp[2*i+1] + src1.fp16[2*i+0] * tsrc2.fp16[2*i+1]
ELSE IF *zeroing*:
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If (dest_reg == src1_reg) or (dest_reg == src2_reg).
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a complex multiply and accumulate operation. There are normal and complex conjugate
forms of the operation.
The masking for this operation is done on 32-bit quantities representing a pair of FP16 values.
Bits 127:32 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
Rounding is performed at every FMA (fused multiply and add) boundary. Execution occurs as if all MXCSR excep-
tions are masked. MXCSR status bits are updated to reflect exceptional conditions.
Operation
VFCMADDCSH dest{k1}, src1, src2 (AVX512)
IF k1[0] or *no writemask*:
tmp[0] := dest.fp16[0] + src1.fp16[0] * src2.fp16[0]
tmp[1] := dest.fp16[1] + src1.fp16[1] * src2.fp16[0]
Other Exceptions
EVEX-encoded instructions, see Table 2-60, “Type E10 Class Exception Conditions.”
Additionally:
#UD If (dest_reg == src1_reg) or (dest_reg == src2_reg).
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a complex multiply operation. There are normal and complex conjugate forms of the oper-
ation. The broadcasting and masking for this operation is done on 32-bit quantities representing a pair of FP16
values.
Rounding is performed at every FMA (fused multiply and add) boundary. Execution occurs as if all MXCSR excep-
tions are masked. MXCSR status bits are updated to reflect exceptional conditions.
Operation
VFCMULCPH dest{k1}, src1, src2 (AVX512)
VL = 128, 256 or 512
KL := VL/32
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF broadcasting and src2 is memory:
tsrc2.fp16[2*i+0] := src2.fp16[0]
tsrc2.fp16[2*i+1] := src2.fp16[1]
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
tmp.fp16[2*i+0] := src1.fp16[2*i+0] * tsrc2.fp16[2*i+0]
tmp.fp16[2*i+1] := src1.fp16[2*i+1] * tsrc2.fp16[2*i+0]
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
// conjugate version subtracts odd final term
dest.fp16[2*i] := tmp.fp16[2*i+0] +src1.fp16[2*i+1] * tsrc2.fp16[2*i+1]
dest.fp16[2*i+1] := tmp.fp16[2*i+1] - src1.fp16[2*i+0] * tsrc2.fp16[2*i+1]
ELSE IF *zeroing*:
dest.fp16[2*i+0] := 0
dest.fp16[2*i+1] := 0
DEST[MAXVL-1:VL] := 0
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF broadcasting and src2 is memory:
tsrc2.fp16[2*i+0] := src2.fp16[0]
tsrc2.fp16[2*i+1] := src2.fp16[1]
ELSE:
tsrc2.fp16[2*i+0] := src2.fp16[2*i+0]
tsrc2.fp16[2*i+1] := src2.fp16[2*i+1]
FOR i := 0 to kl-1:
IF k1[i] or *no writemask*:
tmp.fp16[2*i+0] := src1.fp16[2*i+0] * tsrc2.fp16[2*i+0]
tmp.fp16[2*i+1] := src1.fp16[2*i+1] * tsrc2.fp16[2*i+0]
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
// non-conjugate version subtracts last even term
dest.fp16[2*i+0] := tmp.fp16[2*i+0] - src1.fp16[2*i+1] * tsrc2.fp16[2*i+1]
dest.fp16[2*i+1] := tmp.fp16[2*i+1] + src1.fp16[2*i+0] * tsrc2.fp16[2*i+1]
ELSE IF *zeroing*:
dest.fp16[2*i+0] := 0
dest.fp16[2*i+1] := 0
DEST[MAXVL-1:VL] := 0
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a complex multiply operation. There are normal and complex conjugate forms of the oper-
ation. The masking for this operation is done on 32-bit quantities representing a pair of FP16 values.
Bits 127:32 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
Rounding is performed at every FMA (fused multiply and add) boundary. Execution occurs as if all MXCSR excep-
tions are masked. MXCSR status bits are updated to reflect exceptional conditions.
Operation
VFCMULCSH dest{k1}, src1, src2 (AVX512)
KL := VL / 32
Other Exceptions
EVEX-encoded instructions, see Table 2-60, “Type E10 Class Exception Conditions.”
Additionally:
#UD If (dest_reg == src1_reg) or (dest_reg == src2_reg).
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Perform fix-up of quad-word elements encoded in double precision floating-point format in the first source operand
(the second operand) using a 32-bit, two-level look-up table specified in the corresponding quadword element of
the second source operand (the third operand) with exception reporting specifier imm8. The elements that are
fixed-up are selected by mask bits of 1 specified in the opmask k1. Mask bits of 0 in the opmask k1 or table
response action of 0000b preserves the corresponding element of the first operand. The fixed-up elements from
the first source operand and the preserved element in the first operand are combined as the final results in the
destination operand (the first operand).
The destination and the first source operands are ZMM/YMM/XMM registers. The second source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-
bit memory location.
The two-level look-up table perform a fix-up of each double precision floating-point input data in the first source
operand by decoding the input data encoding into 8 token types. A response table is defined for each token type
that converts the input encoding in the first source operand with one of 16 response actions.
This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source
so that they match the spec, although it is generally useful for fixing up the results of multiple-instruction
sequences to reflect special-number inputs. For example, consider rcp(0). Input 0 to rcp, and you should get INF
according to the DX10 spec. However, evaluating rcp via Newton-Raphson, where x=approx(1/0), yields an incor-
rect result. To deal with this, VFIXUPIMMPD can be used after the N-R reciprocal sequence to set the result to the
correct value (i.e., INF when the input is 0).
If MXCSR.DAZ is not set, denormal input elements in the first source operand are considered as normal inputs and
do not trigger any fixup nor fault reporting.
Imm8 is used to set the required flags reporting. It supports #ZE and #IE fault reporting (see details below).
MXCSR mask bits are ignored and are treated as if all mask bits are set to masked response). If any of the imm8
bits is set and the condition met for fault reporting, MXCSR.IE or MXCSR.ZE might be updated.
Operation
enum TOKEN_TYPE
{
QNAN_TOKEN := 0,
SNAN_TOKEN := 1,
ZERO_VALUE_TOKEN := 2,
POS_ONE_VALUE_TOKEN := 3,
NEG_INF_TOKEN := 4,
POS_INF_TOKEN := 5,
NEG_VALUE_TOKEN := 6,
POS_VALUE_TOKEN := 7
}
CASE(token_response[3:0]) {
0000: dest[63:0] := dest[63:0]; ; preserve content of DEST
0001: dest[63:0] := tsrc[63:0]; ; pass through src1 normal input value, denormal as zero
0010: dest[63:0] := QNaN(tsrc[63:0]);
0011: dest[63:0] := QNAN_Indefinite;
0100: dest[63:0] := -INF;
0101: dest[63:0] := +INF;
0110: dest[63:0] := tsrc.sign? –INF : +INF;
0111: dest[63:0] := -0;
1000: dest[63:0] := +0;
1001: dest[63:0] := -1;
1010: dest[63:0] := +1;
1011: dest[63:0] := ½;
1100: dest[63:0] := 90.0;
1101: dest[63:0] := PI/2;
1110: dest[63:0] := MAX_FLOAT;
1111: dest[63:0] := -MAX_FLOAT;
} ; end of token_response CASE
VFIXUPIMMPD
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+63:i] := FIXUPIMM_DP(DEST[i+63:i], SRC1[i+63:i], SRC2[63:0], imm8 [7:0])
ELSE
DEST[i+63:i] := FIXUPIMM_DP(DEST[i+63:i], SRC1[i+63:i], SRC2[i+63:i], imm8 [7:0])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE DEST[i+63:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
7 6 5 4 3 2 1 0
+ INF #IE
- VE #IE
- INF #IE
SNaN #IE
ONE #IE
ONE #ZE
ZERO #IE
ZERO #ZE
Zero, Invalid.
Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Perform fix-up of doubleword elements encoded in single precision floating-point format in the first source operand
(the second operand) using a 32-bit, two-level look-up table specified in the corresponding doubleword element of
the second source operand (the third operand) with exception reporting specifier imm8. The elements that are
fixed-up are selected by mask bits of 1 specified in the opmask k1. Mask bits of 0 in the opmask k1 or table
response action of 0000b preserves the corresponding element of the first operand. The fixed-up elements from
the first source operand and the preserved element in the first operand are combined as the final results in the
destination operand (the first operand).
The destination and the first source operands are ZMM/YMM/XMM registers. The second source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-
bit memory location.
The two-level look-up table perform a fix-up of each single precision floating-point input data in the first source
operand by decoding the input data encoding into 8 token types. A response table is defined for each token type
that converts the input encoding in the first source operand with one of 16 response actions.
This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source
so that they match the spec, although it is generally useful for fixing up the results of multiple-instruction
sequences to reflect special-number inputs. For example, consider rcp(0). Input 0 to rcp, and you should get INF
according to the DX10 spec. However, evaluating rcp via Newton-Raphson, where x=approx(1/0), yields an incor-
rect result. To deal with this, VFIXUPIMMPS can be used after the N-R reciprocal sequence to set the result to the
correct value (i.e., INF when the input is 0).
If MXCSR.DAZ is not set, denormal input elements in the first source operand are considered as normal inputs and
do not trigger any fixup nor fault reporting.
Imm8 is used to set the required flags reporting. It supports #ZE and #IE fault reporting (see details below).
MXCSR.DAZ is used and refer to zmm2 only (i.e., zmm1 is not considered as zero in case MXCSR.DAZ is set).
MXCSR mask bits are ignored and are treated as if all mask bits are set to masked response). If any of the imm8
bits is set and the condition met for fault reporting, MXCSR.IE or MXCSR.ZE might be updated.
CASE(token_response[3:0]) {
0000: dest[31:0] := dest[31:0]; ; preserve content of DEST
0001: dest[31:0] := tsrc[31:0]; ; pass through src1 normal input value, denormal as zero
0010: dest[31:0] := QNaN(tsrc[31:0]);
0011: dest[31:0] := QNAN_Indefinite;
0100: dest[31:0] := -INF;
0101: dest[31:0] := +INF;
0110: dest[31:0] := tsrc.sign? –INF : +INF;
0111: dest[31:0] := -0;
1000: dest[31:0] := +0;
1001: dest[31:0] := -1;
1010: dest[31:0] := +1;
1011: dest[31:0] := ½;
1100: dest[31:0] := 90.0;
1101: dest[31:0] := PI/2;
1110: dest[31:0] := MAX_FLOAT;
1111: dest[31:0] := -MAX_FLOAT;
} ; end of token_response CASE
VFIXUPIMMPS (EVEX)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN
DEST[i+31:i] := FIXUPIMM_SP(DEST[i+31:i], SRC1[i+31:i], SRC2[31:0], imm8 [7:0])
ELSE
DEST[i+31:i] := FIXUPIMM_SP(DEST[i+31:i], SRC1[i+31:i], SRC2[i+31:i], imm8 [7:0])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0 ; zeroing-masking
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
7 6 5 4 3 2 1 0
+ INF #IE
- VE #IE
- INF #IE
SNaN #IE
ONE #IE
ONE #ZE
ZERO #IE
ZERO #ZE
Zero, Invalid.
Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Perform a fix-up of the low quadword element encoded in double precision floating-point format in the first source
operand (the second operand) using a 32-bit, two-level look-up table specified in the low quadword element of the
second source operand (the third operand) with exception reporting specifier imm8. The element that is fixed-up
is selected by mask bit of 1 specified in the opmask k1. Mask bit of 0 in the opmask k1 or table response action of
0000b preserves the corresponding element of the first operand. The fixed-up element from the first source
operand or the preserved element in the first operand becomes the low quadword element of the destination
operand (the first operand). Bits 127:64 of the destination operand is copied from the corresponding bits of the
first source operand. The destination and first source operands are XMM registers. The second source operand can
be a XMM register or a 64- bit memory location.
The two-level look-up table perform a fix-up of each double precision floating-point input data in the first source
operand by decoding the input data encoding into 8 token types. A response table is defined for each token type
that converts the input encoding in the first source operand with one of 16 response actions.
This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source
so that they match the spec, although it is generally useful for fixing up the results of multiple-instruction
sequences to reflect special-number inputs. For example, consider rcp(0). Input 0 to rcp, and you should get INF
according to the DX10 spec. However, evaluating rcp via Newton-Raphson, where x=approx(1/0), yields an incor-
rect result. To deal with this, VFIXUPIMMPD can be used after the N-R reciprocal sequence to set the result to the
correct value (i.e., INF when the input is 0).
If MXCSR.DAZ is not set, denormal input elements in the first source operand are considered as normal inputs and
do not trigger any fixup nor fault reporting.
Imm8 is used to set the required flags reporting. It supports #ZE and #IE fault reporting (see details below).
MXCSR.DAZ is used and refer to zmm2 only (i.e., zmm1 is not considered as zero in case MXCSR.DAZ is set).
MXCSR mask bits are ignored and are treated as if all mask bits are set to masked response). If any of the imm8
bits is set and the condition met for fault reporting, MXCSR.IE or MXCSR.ZE might be updated.
CASE(token_response[3:0]) {
0000: dest[63:0] := dest[63:0] ; preserve content of DEST
0001: dest[63:0] := tsrc[63:0]; ; pass through src1 normal input value, denormal as zero
0010: dest[63:0] := QNaN(tsrc[63:0]);
0011: dest[63:0] := QNAN_Indefinite;
0100:dest[63:0] := -INF;
0101: dest[63:0] := +INF;
0110: dest[63:0] := tsrc.sign? –INF : +INF;
0111: dest[63:0] := -0;
1000: dest[63:0] := +0;
1001: dest[63:0] := -1;
1010: dest[63:0] := +1;
1011: dest[63:0] := ½;
1100: dest[63:0] := 90.0;
1101: dest[63:0] := PI/2;
1110: dest[63:0] := MAX_FLOAT;
1111: dest[63:0] := -MAX_FLOAT;
} ; end of token_response CASE
7 6 5 4 3 2 1 0
+ INF #IE
- VE #IE
- INF #IE
SNaN #IE
ONE #IE
ONE #ZE
ZERO #IE
ZERO #ZE
Zero, Invalid
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Perform a fix-up of the low doubleword element encoded in single precision floating-point format in the first source
operand (the second operand) using a 32-bit, two-level look-up table specified in the low doubleword element of
the second source operand (the third operand) with exception reporting specifier imm8. The element that is fixed-
up is selected by mask bit of 1 specified in the opmask k1. Mask bit of 0 in the opmask k1 or table response action
of 0000b preserves the corresponding element of the first operand. The fixed-up element from the first source
operand or the preserved element in the first operand becomes the low doubleword element of the destination
operand (the first operand) Bits 127:32 of the destination operand is copied from the corresponding bits of the first
source operand. The destination and first source operands are XMM registers. The second source operand can be a
XMM register or a 32-bit memory location.
The two-level look-up table perform a fix-up of each single precision floating-point input data in the first source
operand by decoding the input data encoding into 8 token types. A response table is defined for each token type
that converts the input encoding in the first source operand with one of 16 response actions.
This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source
so that they match the spec, although it is generally useful for fixing up the results of multiple-instruction
sequences to reflect special-number inputs. For example, consider rcp(0). Input 0 to rcp, and you should get INF
according to the DX10 spec. However, evaluating rcp via Newton-Raphson, where x=approx(1/0), yields an incor-
rect result. To deal with this, VFIXUPIMMPD can be used after the N-R reciprocal sequence to set the result to the
correct value (i.e., INF when the input is 0).
If MXCSR.DAZ is not set, denormal input elements in the first source operand are considered as normal inputs and
do not trigger any fixup nor fault reporting.
Imm8 is used to set the required flags reporting. It supports #ZE and #IE fault reporting (see details below).
MXCSR.DAZ is used and refer to zmm2 only (i.e., zmm1 is not considered as zero in case MXCSR.DAZ is set).
MXCSR mask bits are ignored and are treated as if all mask bits are set to masked response). If any of the imm8
bits is set and the condition met for fault reporting, MXCSR.IE or MXCSR.ZE might be updated.
CASE(token_response[3:0]) {
0000: dest[31:0] := dest[31:0]; ; preserve content of DEST
0001: dest[31:0] := tsrc[31:0]; ; pass through src1 normal input value, denormal as zero
0010: dest[31:0] := QNaN(tsrc[31:0]);
0011: dest[31:0] := QNAN_Indefinite;
0100: dest[31:0] := -INF;
0101: dest[31:0] := +INF;
0110: dest[31:0] := tsrc.sign? –INF : +INF;
0111: dest[31:0] := -0;
1000: dest[31:0] := +0;
1001: dest[31:0] := -1;
1010: dest[31:0] := +1;
1011: dest[31:0] := ½;
1100: dest[31:0] := 90.0;
1101: dest[31:0] := PI/2;
1110: dest[31:0] := MAX_FLOAT;
1111: dest[31:0] := -MAX_FLOAT;
} ; end of token_response CASE
7 6 5 4 3 2 1 0
+ INF #IE
- VE #IE
- INF #IE
SNaN #IE
ONE #IE
ONE #ZE
ZERO #IE
ZERO #ZE
Zero, Invalid
Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”
Description
Performs a set of SIMD multiply-add computation on packed double precision floating-point values using three
source operands and writes the multiply-add results in the destination operand. The destination operand is also the
first source operand. The second operand must be a SIMD register. The third source operand can be a SIMD
register or a memory location.
VFMADD132PD: Multiplies the two, four or eight packed double precision floating-point values from the first source
operand to the two, four or eight packed double precision floating-point values in the third source operand, adds
the infinite precision intermediate result to the two, four or eight packed double precision floating-point values in
the second source operand, performs rounding and stores the resulting two, four or eight packed double precision
floating-point values to the destination operand (first source operand).
VFMADD213PD: Multiplies the two, four or eight packed double precision floating-point values from the second
source operand to the two, four or eight packed double precision floating-point values in the first source operand,
adds the infinite precision intermediate result to the two, four or eight packed double precision floating-point
values in the third source operand, performs rounding and stores the resulting two, four or eight packed double
precision floating-point values to the destination operand (first source operand).
VFMADD231PD: Multiplies the two, four or eight packed double precision floating-point values from the second
source to the two, four or eight packed double precision floating-point values in the third source operand, adds the
infinite precision intermediate result to the two, four or eight packed double precision floating-point values in the
first source operand, performs rounding and stores the resulting two, four or eight packed double precision
floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) is a ZMM register and encoded in
reg_field. The second source operand is a ZMM register and encoded in EVEX.vvvv. The third source operand is a
ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 64-bit memory location. The
destination operand is conditionally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.
VFMADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] + SRC2[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] + SRC2[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] + SRC3[63:0])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] + SRC3[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[63:0] + DEST[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] + DEST[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
Description
Performs a set of SIMD multiply-add computation on packed single precision floating-point values using three
source operands and writes the multiply-add results in the destination operand. The destination operand is also the
first source operand. The second operand must be a SIMD register. The third source operand can be a SIMD
register or a memory location.
VFMADD132PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the first
source operand to the four, eight or sixteen packed single precision floating-point values in the third source
operand, adds the infinite precision intermediate result to the four, eight or sixteen packed single precision floating-
point values in the second source operand, performs rounding and stores the resulting four, eight or sixteen packed
single precision floating-point values to the destination operand (first source operand).
VFMADD213PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the second
source operand to the four, eight or sixteen packed single precision floating-point values in the first source
operand, adds the infinite precision intermediate result to the four, eight or sixteen packed single precision floating-
point values in the third source operand, performs rounding and stores the resulting the four, eight or sixteen
packed single precision floating-point values to the destination operand (first source operand).
VFMADD231PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the second
source operand to the four, eight or sixteen packed single precision floating-point values in the third source
operand, adds the infinite precision intermediate result to the four, eight or sixteen packed single precision floating-
point values in the first source operand, performs rounding and stores the resulting four, eight or sixteen packed
single precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) is a ZMM register and encoded in
reg_field. The second source operand is a ZMM register and encoded in EVEX.vvvv. The third source operand is a
ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 32-bit memory location. The
destination operand is conditionally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.
Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).
VFMADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
VFMADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[31:0] + SRC2[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[i+31:i] + SRC2[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*DEST[i+31:i] + SRC3[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
VFMADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] + SRC3[31:0])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] + SRC3[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] + DEST[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD multiply-add computation on the low double precision floating-point values using three source
operands and writes the multiply-add result in the destination operand. The destination operand is also the first
source operand. The first and second operand are XMM registers. The third source operand can be an XMM register
or a 64-bit memory location.
VFMADD132SD: Multiplies the low double precision floating-point value from the first source operand to the low
double precision floating-point value in the third source operand, adds the infinite precision intermediate result to
the low double precision floating-point values in the second source operand, performs rounding and stores the
resulting double precision floating-point value to the destination operand (first source operand).
VFMADD213SD: Multiplies the low double precision floating-point value from the second source operand to the low
double precision floating-point value in the first source operand, adds the infinite precision intermediate result to
the low double precision floating-point value in the third source operand, performs rounding and stores the
resulting double precision floating-point value to the destination operand (first source operand).
VFMADD231SD: Multiplies the low double precision floating-point value from the second source to the low double
precision floating-point value in the third source operand, adds the infinite precision intermediate result to the low
double precision floating-point value in the first source operand, performs rounding and stores the resulting double
precision floating-point value to the destination operand (first source operand).
Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD multiply-add computation on single precision floating-point values using three source operands
and writes the multiply-add results in the destination operand. The destination operand is also the first source
operand. The first and second operands are XMM registers. The third source operand can be a XMM register or a
32-bit memory location.
VFMADD132SS: Multiplies the low single precision floating-point value from the first source operand to the low
single precision floating-point value in the third source operand, adds the infinite precision intermediate result to
the low single precision floating-point value in the second source operand, performs rounding and stores the
resulting single precision floating-point value to the destination operand (first source operand).
VFMADD213SS: Multiplies the low single precision floating-point value from the second source operand to the low
single precision floating-point value in the first source operand, adds the infinite precision intermediate result to
the low single precision floating-point value in the third source operand, performs rounding and stores the resulting
single precision floating-point value to the destination operand (first source operand).
VFMADD231SS: Multiplies the low single precision floating-point value from the second source operand to the low
single precision floating-point value in the third source operand, adds the infinite precision intermediate result to
the low single precision floating-point value in the first source operand, performs rounding and stores the resulting
single precision floating-point value to the destination operand (first source operand).
Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VFMADDSUB132PD: Multiplies the two, four, or eight packed double precision floating-point values from the first
source operand to the two or four packed double precision floating-point values in the third source operand. From
the infinite precision intermediate result, adds the odd double precision floating-point elements and subtracts the
even double precision floating-point values in the second source operand, performs rounding and stores the
resulting two or four packed double precision floating-point values to the destination operand (first source
operand).
VFMADDSUB213PD: Multiplies the two, four, or eight packed double precision floating-point values from the second
source operand to the two or four packed double precision floating-point values in the first source operand. From
the infinite precision intermediate result, adds the odd double precision floating-point elements and subtracts the
even double precision floating-point values in the third source operand, performs rounding and stores the resulting
two or four packed double precision floating-point values to the destination operand (first source operand).
VFMADDSUB231PD: Multiplies the two, four, or eight packed double precision floating-point values from the second
source operand to the two or four packed double precision floating-point values in the third source operand. From
the infinite precision intermediate result, adds the odd double precision floating-point elements and subtracts the
even double precision floating-point values in the first source operand, performs rounding and stores the resulting
two or four packed double precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).
VFMADDSUB132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] - SRC2[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] - SRC2[i+63:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] + SRC2[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] + SRC2[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMADDSUB213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*DEST[i+63:i] - SRC3[i+63:i])
ELSE DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*DEST[i+63:i] + SRC3[i+63:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMADDSUB213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] - SRC3[63:0])
ELSE
VFMADDSUB231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i])
ELSE DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*SRC3[i+63:i] + DEST[i+63:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[63:0] - DEST[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[63:0] + DEST[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] + DEST[i+63:i])
FI;
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Operation
VFMADDSUB132PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a register
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *j is even*:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j] * SRC3.fp16[j] - SRC2.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j] * SRC3.fp16[j] + SRC2.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
VFMADDSUB132PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *j is even*:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j] * t3 - SRC2.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j] * t3 + SRC2.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
DEST[MAXVL-1:VL] := 0
VFMADDSUB213PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a register
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *j is even*:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*DEST.fp16[j] - SRC3.fp16[j])
ELSE
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*DEST.fp16[j] + SRC3.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
VFMADDSUB213PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *j is even*:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * DEST.fp16[j] - t3)
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * DEST.fp16[j] + t3)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *j is even:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * SRC3.fp16[j] - DEST.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * SRC3.fp16[j] + DEST.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
VFMADDSUB231PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *j is even*:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * t3 - DEST.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * t3 + DEST.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VFMADDSUB132PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the first
source operand to the corresponding packed single precision floating-point values in the third source operand.
From the infinite precision intermediate result, adds the odd single precision floating-point elements and subtracts
the even single precision floating-point values in the second source operand, performs rounding and stores the
resulting packed single precision floating-point values to the destination operand (first source operand).
VFMADDSUB213PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the
second source operand to the corresponding packed single precision floating-point values in the first source
operand. From the infinite precision intermediate result, adds the odd single precision floating-point elements and
subtracts the even single precision floating-point values in the third source operand, performs rounding and stores
the resulting packed single precision floating-point values to the destination operand (first source operand).
VFMADDSUB231PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the
second source operand to the corresponding packed single precision floating-point values in the third source
operand. From the infinite precision intermediate result, adds the odd single precision floating-point elements and
subtracts the even single precision floating-point values in the first source operand, performs rounding and stores
the resulting packed single precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.
Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).
VFMADDSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) (4, 128), (8, 256),= (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+31:i] :=
RoundFPControl(DEST[i+31:i]*SRC3[i+31:i] - SRC2[i+31:i])
ELSE DEST[i+31:i] :=
RoundFPControl(DEST[i+31:i]*SRC3[i+31:i] + SRC2[i+31:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMADDSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[31:0] - SRC2[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[i+31:i] - SRC2[i+31:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMADDSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*DEST[i+31:i] - SRC3[i+31:i])
ELSE DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*DEST[i+31:i] + SRC3[i+31:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMADDSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
VFMADDSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i])
ELSE DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
VFMADDSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] - DEST[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] + DEST[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i])
FI;
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a set of SIMD multiply-subtract computation on packed double precision floating-point values using three
source operands and writes the multiply-subtract results in the destination operand. The destination operand is
also the first source operand. The second operand must be a SIMD register. The third source operand can be a
SIMD register or a memory location.
VFMSUB132PD: Multiplies the two, four or eight packed double precision floating-point values from the first source
operand to the two, four or eight packed double precision floating-point values in the third source operand. From
the infinite precision intermediate result, subtracts the two, four or eight packed double precision floating-point
values in the second source operand, performs rounding and stores the resulting two, four or eight packed double
precision floating-point values to the destination operand (first source operand).
VFMSUB213PD: Multiplies the two, four or eight packed double precision floating-point values from the second
source operand to the two, four or eight packed double precision floating-point values in the first source operand.
From the infinite precision intermediate result, subtracts the two, four or eight packed double precision floating-
point values in the third source operand, performs rounding and stores the resulting two, four or eight packed
double precision floating-point values to the destination operand (first source operand).
VFMSUB231PD: Multiplies the two, four or eight packed double precision floating-point values from the second
source to the two, four or eight packed double precision floating-point values in the third source operand. From the
infinite precision intermediate result, subtracts the two, four or eight packed double precision floating-point values
in the first source operand, performs rounding and stores the resulting two, four or eight packed double precision
floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).
VFMSUB132PD DEST, SRC2, SRC3 (EVEX Encoded Versions, When SRC3 Operand is a Register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
RoundFPControl(DEST[i+63:i]*SRC3[i+63:i] - SRC2[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMSUB132PD DEST, SRC2, SRC3 (EVEX Encoded Versions, When SRC3 Operand is a Memory Source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] - SRC2[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] - SRC2[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMSUB213PD DEST, SRC2, SRC3 (EVEX Encoded Versions, When SRC3 Operand is a Memory Source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] - SRC3[63:0])
+31:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] - SRC3[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMSUB231PD DEST, SRC2, SRC3 (EVEX Encoded Versions, When SRC3 Operand is a Memory Source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[63:0] - DEST[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-46, “Type E2 Class Exception Conditions.”
Description
Performs a set of SIMD multiply-subtract computation on packed single precision floating-point values using three
source operands and writes the multiply-subtract results in the destination operand. The destination operand is
also the first source operand. The second operand must be a SIMD register. The third source operand can be a
SIMD register or a memory location.
VFMSUB132PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the first
source operand to the four, eight or sixteen packed single precision floating-point values in the third source
operand. From the infinite precision intermediate result, subtracts the four, eight or sixteen packed single precision
floating-point values in the second source operand, performs rounding and stores the resulting four, eight or
sixteen packed single precision floating-point values to the destination operand (first source operand).
VFMSUB213PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the second
source operand to the four, eight or sixteen packed single precision floating-point values in the first source
operand. From the infinite precision intermediate result, subtracts the four, eight or sixteen packed single precision
floating-point values in the third source operand, performs rounding and stores the resulting four, eight or sixteen
packed single precision floating-point values to the destination operand (first source operand).
VFMSUB231PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the second
source to the four, eight or sixteen packed single precision floating-point values in the third source operand. From
the infinite precision intermediate result, subtracts the four, eight or sixteen packed single precision floating-point
values in the first source operand, performs rounding and stores the resulting four, eight or sixteen packed single
precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.
Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).
VFMSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[31:0] - SRC2[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[i+31:i] - SRC2[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] - SRC3[31:0])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*DEST[i+31:i] - SRC3[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] - DEST[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD multiply-subtract computation on the low packed double precision floating-point values using
three source operands and writes the multiply-subtract result in the destination operand. The destination operand
is also the first source operand. The second operand must be a XMM register. The third source operand can be a
XMM register or a 64-bit memory location.
VFMSUB132SD: Multiplies the low packed double precision floating-point value from the first source operand to the
low packed double precision floating-point value in the third source operand. From the infinite precision interme-
diate result, subtracts the low packed double precision floating-point values in the second source operand,
performs rounding and stores the resulting packed double precision floating-point value to the destination operand
(first source operand).
VFMSUB213SD: Multiplies the low packed double precision floating-point value from the second source operand to
the low packed double precision floating-point value in the first source operand. From the infinite precision inter-
mediate result, subtracts the low packed double precision floating-point value in the third source operand,
performs rounding and stores the resulting packed double precision floating-point value to the destination operand
(first source operand).
VFMSUB231SD: Multiplies the low packed double precision floating-point value from the second source to the low
packed double precision floating-point value in the third source operand. From the infinite precision intermediate
result, subtracts the low packed double precision floating-point value in the first source operand, performs
Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD multiply-subtract computation on the low packed single precision floating-point values using
three source operands and writes the multiply-subtract result in the destination operand. The destination operand
is also the first source operand. The second operand must be a XMM register. The third source operand can be a
XMM register or a 32-bit memory location.
VFMSUB132SS: Multiplies the low packed single precision floating-point value from the first source operand to the
low packed single precision floating-point value in the third source operand. From the infinite precision interme-
diate result, subtracts the low packed single precision floating-point values in the second source operand, performs
rounding and stores the resulting packed single precision floating-point value to the destination operand (first
source operand).
VFMSUB213SS: Multiplies the low packed single precision floating-point value from the second source operand to
the low packed single precision floating-point value in the first source operand. From the infinite precision interme-
diate result, subtracts the low packed single precision floating-point value in the third source operand, performs
rounding and stores the resulting packed single precision floating-point value to the destination operand (first
source operand).
VFMSUB231SS: Multiplies the low packed single precision floating-point value from the second source to the low
packed single precision floating-point value in the third source operand. From the infinite precision intermediate
result, subtracts the low packed single precision floating-point value in the first source operand, performs rounding
Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VFMSUBADD132PD: Multiplies the two, four, or eight packed double precision floating-point values from the first
source operand to the two or four packed double precision floating-point values in the third source operand. From
the infinite precision intermediate result, subtracts the odd double precision floating-point elements and adds the
even double precision floating-point values in the second source operand, performs rounding and stores the
resulting two or four packed double precision floating-point values to the destination operand (first source
operand).
VFMSUBADD213PD: Multiplies the two, four, or eight packed double precision floating-point values from the second
source operand to the two or four packed double precision floating-point values in the first source operand. From
the infinite precision intermediate result, subtracts the odd double precision floating-point elements and adds the
even double precision floating-point values in the third source operand, performs rounding and stores the resulting
two or four packed double precision floating-point values to the destination operand (first source operand).
VFMSUBADD231PD: Multiplies the two, four, or eight packed double precision floating-point values from the second
source operand to the two or four packed double precision floating-point values in the third source operand. From
the infinite precision intermediate result, subtracts the odd double precision floating-point elements and adds the
even double precision floating-point values in the first source operand, performs rounding and stores the resulting
two or four packed double precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).
VFMSUBADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
VFMSUBADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] + SRC2[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] + SRC2[i+63:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[63:0] - SRC2[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(DEST[i+63:i]*SRC3[i+63:i] - SRC2[i+63:i])
FI;
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
VFMSUBADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*DEST[i+63:i] + SRC3[i+63:i])
ELSE DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*DEST[i+63:i] - SRC3[i+63:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMSUBADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] + SRC3[63:0])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*DEST[i+63:i] + SRC3[i+63:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
VFMSUBADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*SRC3[i+63:i] + DEST[i+63:i])
ELSE DEST[i+63:i] :=
RoundFPControl(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMSUBADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(SRC2[i+63:i]*SRC3[i+63:i] - DEST[i+63:i])
FI;
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Operation
VFMSUBADD132PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a register
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *j is even*:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j]*SRC3.fp16[j] + SRC2.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j]*SRC3.fp16[j] - SRC2.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
VFMSUBADD132PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *j is even*:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j] * t3 + SRC2.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j] * t3 - SRC2.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
DEST[MAXVL-1:VL] := 0:
VFMSUBADD213PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a register
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *j is even*:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*DEST.fp16[j] + SRC3.fp16[j])
ELSE
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*DEST.fp16[j] - SRC3.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
VFMSUBADD213PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *j is even*:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * DEST.fp16[j] + t3 )
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * DEST.fp16[j] - t3 )
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0:
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *j is even:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*SRC3.fp16[j] + DEST.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*SRC3.fp16[j] - DEST.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
VFMSUBADD231PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *j is even*:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * t3 + DEST.fp16[j] )
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * t3 - DEST.fp16[j] )
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VFMSUBADD132PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the first
source operand to the corresponding packed single precision floating-point values in the third source operand.
From the infinite precision intermediate result, subtracts the odd single precision floating-point elements and adds
the even single precision floating-point values in the second source operand, performs rounding and stores the
resulting packed single precision floating-point values to the destination operand (first source operand).
VFMSUBADD213PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the
second source operand to the corresponding packed single precision floating-point values in the first source
operand. From the infinite precision intermediate result, subtracts the odd single precision floating-point elements
and adds the even single precision floating-point values in the third source operand, performs rounding and stores
the resulting packed single precision floating-point values to the destination operand (first source operand).
VFMSUBADD231PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the
second source operand to the corresponding packed single precision floating-point values in the third source
operand. From the infinite precision intermediate result, subtracts the odd single precision floating-point elements
and adds the even single precision floating-point values in the first source operand, performs rounding and stores
the resulting packed single precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.
Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).
VFMSUBADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+31:i] :=
RoundFPControl(DEST[i+31:i]*SRC3[i+31:i] + SRC2[i+31:i])
ELSE DEST[i+31:i] :=
RoundFPControl(DEST[i+31:i]*SRC3[i+31:i] - SRC2[i+31:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMSUBADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[31:0] + SRC2[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(DEST[i+31:i]*SRC3[i+31:i] + SRC2[i+31:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMSUBADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*DEST[i+31:i] + SRC3[i+31:i])
ELSE DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*DEST[i+31:i] - SRC3[i+31:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFMSUBADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
VFMSUBADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i])
ELSE DEST[i+31:i] :=
RoundFPControl(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i])
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
VFMSUBADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF j *is even*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] + DEST[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] + DEST[i+31:i])
FI;
ELSE
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[31:0] - DEST[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(SRC2[i+31:i]*SRC3[i+31:i] - DEST[i+31:i])
FI;
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-295
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.512.66.0F38.W1 9C /r B V/V AVX512F Multiply packed double precision floating-point
VFNMADD132PD zmm1 {k1}{z}, OR AVX10.11 values from zmm1 and zmm3/m512/m64bcst,
zmm2, zmm3/m512/m64bcst{er} negate the multiplication result and add to zmm2
and put result in zmm1.
EVEX.512.66.0F38.W1 AC /r B V/V AVX512F Multiply packed double precision floating-point
VFNMADD213PD zmm1 {k1}{z}, OR AVX10.11 values from zmm1 and zmm2, negate the
zmm2, zmm3/m512/m64bcst{er} multiplication result and add to
zmm3/m512/m64bcst and put result in zmm1.
EVEX.512.66.0F38.W1 BC /r B V/V AVX512F Multiply packed double precision floating-point
VFNMADD231PD zmm1 {k1}{z}, OR AVX10.11 values from zmm2 and zmm3/m512/m64bcst,
zmm2, zmm3/m512/m64bcst{er} negate the multiplication result and add to zmm1
and put result in zmm1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VFNMADD132PD: Multiplies the two, four or eight packed double precision floating-point values from the first
source operand to the two, four or eight packed double precision floating-point values in the third source operand,
adds the negated infinite precision intermediate result to the two, four or eight packed double precision floating-
point values in the second source operand, performs rounding and stores the resulting two, four or eight packed
double precision floating-point values to the destination operand (first source operand).
VFNMADD213PD: Multiplies the two, four or eight packed double precision floating-point values from the second
source operand to the two, four or eight packed double precision floating-point values in the first source operand,
adds the negated infinite precision intermediate result to the two, four or eight packed double precision floating-
point values in the third source operand, performs rounding and stores the resulting two, four or eight packed
double precision floating-point values to the destination operand (first source operand).
VFNMADD231PD: Multiplies the two, four or eight packed double precision floating-point values from the second
source to the two, four or eight packed double precision floating-point values in the third source operand, the
negated infinite precision intermediate result to the two, four or eight packed double precision floating-point values
in the first source operand, performs rounding and stores the resulting two, four or eight packed double precision
floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.
VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-296
Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).
VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-297
VFNMADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
RoundFPControl(-(DEST[i+63:i]*SRC3[i+63:i]) + SRC2[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMADD132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(DEST[i+63:i]*SRC3[63:0]) + SRC2[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(DEST[i+63:i]*SRC3[i+63:i]) + SRC2[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-298
VFNMADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
RoundFPControl(-(SRC2[i+63:i]*DEST[i+63:i]) + SRC3[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMADD213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(SRC2[i+63:i]*DEST[i+63:i]) + SRC3[63:0])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(SRC2[i+63:i]*DEST[i+63:i]) + SRC3[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-299
VFNMADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] :=
RoundFPControl(-(SRC2[i+63:i]*SRC3[i+63:i]) + DEST[i+63:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMADD231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(SRC2[i+63:i]*SRC3[63:0]) + DEST[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(SRC2[i+63:i]*SRC3[i+63:i]) + DEST[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-300
Intel C/C++ Compiler Intrinsic Equivalent
VFNMADDxxxPD __m512d _mm512_fnmadd_pd(__m512d a, __m512d b, __m512d c);
VFNMADDxxxPD __m512d _mm512_fnmadd_round_pd(__m512d a, __m512d b, __m512d c, int r);
VFNMADDxxxPD __m512d _mm512_mask_fnmadd_pd(__m512d a, __mmask8 k, __m512d b, __m512d c);
VFNMADDxxxPD __m512d _mm512_maskz_fnmadd_pd(__mmask8 k, __m512d a, __m512d b, __m512d c);
VFNMADDxxxPD __m512d _mm512_mask3_fnmadd_pd(__m512d a, __m512d b, __m512d c, __mmask8 k);
VFNMADDxxxPD __m512d _mm512_mask_fnmadd_round_pd(__m512d a, __mmask8 k, __m512d b, __m512d c, int r);
VFNMADDxxxPD __m512d _mm512_maskz_fnmadd_round_pd(__mmask8 k, __m512d a, __m512d b, __m512d c, int r);
VFNMADDxxxPD __m512d _mm512_mask3_fnmadd_round_pd(__m512d a, __m512d b, __m512d c, __mmask8 k, int r);
VFNMADDxxxPD __m256d _mm256_mask_fnmadd_pd(__m256d a, __mmask8 k, __m256d b, __m256d c);
VFNMADDxxxPD __m256d _mm256_maskz_fnmadd_pd(__mmask8 k, __m256d a, __m256d b, __m256d c);
VFNMADDxxxPD __m256d _mm256_mask3_fnmadd_pd(__m256d a, __m256d b, __m256d c, __mmask8 k);
VFNMADDxxxPD __m128d _mm_mask_fnmadd_pd(__m128d a, __mmask8 k, __m128d b, __m128d c);
VFNMADDxxxPD __m128d _mm_maskz_fnmadd_pd(__mmask8 k, __m128d a, __m128d b, __m128d c);
VFNMADDxxxPD __m128d _mm_mask3_fnmadd_pd(__m128d a, __m128d b, __m128d c, __mmask8 k);
VFNMADDxxxPD __m128d _mm_fnmadd_pd (__m128d a, __m128d b, __m128d c);
VFNMADDxxxPD __m256d _mm256_fnmadd_pd (__m256d a, __m256d b, __m256d c);
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VFNMADD132PD/VFNMADD213PD/VFNMADD231PD—Fused Negative Multiply-Add of Packed Double Precision Floating-Point Values Vol. 2C 5-301
VF[,N]MADD[132,213,231]PH—Fused Multiply-Add of Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP6.W0 98 /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm1 and
VFMADD132PH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm3/m128/m16bcst, add to xmm2, and store
xmm3/m128/m16bcst OR AVX10.11 the result in xmm1.
EVEX.256.66.MAP6.W0 98 /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm1 and
VFMADD132PH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm3/m256/m16bcst, add to ymm2, and store
ymm3/m256/m16bcst OR AVX10.11 the result in ymm1.
EVEX.512.66.MAP6.W0 98 /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm1 and
VFMADD132PH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m16bcst, add to zmm2, and store
zmm3/m512/m16bcst {er} the result in zmm1.
EVEX.128.66.MAP6.W0 A8 /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm1 and
VFMADD213PH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm2, add to xmm3/m128/m16bcst, and store
xmm3/m128/m16bcst OR AVX10.11 the result in xmm1.
EVEX.256.66.MAP6.W0 A8 /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm1 and
VFMADD213PH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm2, add to ymm3/m256/m16bcst, and store
ymm3/m256/m16bcst OR AVX10.11 the result in ymm1.
EVEX.512.66.MAP6.W0 A8 /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm1 and
VFMADD213PH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm2, add to zmm3/m512/m16bcst, and store
zmm3/m512/m16bcst {er} the result in zmm1.
EVEX.128.66.MAP6.W0 B8 /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm2 and
VFMADD231PH xmm1{k1}{z}, xmm2, AND AVX512VL) xmm3/m128/m16bcst, add to xmm1, and store
xmm3/m128/m16bcst OR AVX10.11 the result in xmm1.
EVEX.256.66.MAP6.W0 B8 /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm2 and
VFMADD231PH ymm1{k1}{z}, ymm2, AND AVX512VL) ymm3/m256/m16bcst, add to ymm1, and store
ymm3/m256/m16bcst OR AVX10.11 the result in ymm1.
EVEX.512.66.MAP6.W0 B8 /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm2 and
VFMADD231PH zmm1{k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m16bcst, add to zmm1, and store
zmm3/m512/m16bcst {er} the result in zmm1.
EVEX.128.66.MAP6.W0 9C /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm1 and
VFNMADD132PH xmm1{k1}{z}, AND AVX512VL) xmm3/m128/m16bcst, and negate the value.
xmm2, xmm3/m128/m16bcst OR AVX10.11 Add this value to xmm2, and store the result in
xmm1.
EVEX.256.66.MAP6.W0 9C /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm1 and
VFNMADD132PH ymm1{k1}{z}, AND AVX512VL) ymm3/m256/m16bcst, and negate the value.
ymm2, ymm3/m256/m16bcst OR AVX10.11 Add this value to ymm2, and store the result in
ymm1.
EVEX.512.66.MAP6.W0 9C /r A V/V AVX512-FP16 Multiply packed FP16 values from zmm1 and
VFNMADD132PH zmm1{k1}{z}, OR AVX10.11 zmm3/m512/m16bcst, and negate the value.
zmm2, zmm3/m512/m16bcst {er} Add this value to zmm2, and store the result in
zmm1.
EVEX.128.66.MAP6.W0 AC /r A V/V (AVX512-FP16 Multiply packed FP16 values from xmm1 and
VFNMADD213PH xmm1{k1}{z}, AND AVX512VL) xmm2, and negate the value. Add this value to
xmm2, xmm3/m128/m16bcst OR AVX10.11 xmm3/m128/m16bcst, and store the result in
xmm1.
EVEX.256.66.MAP6.W0 AC /r A V/V (AVX512-FP16 Multiply packed FP16 values from ymm1 and
VFNMADD213PH ymm1{k1}{z}, AND AVX512VL) ymm2, and negate the value. Add this value to
ymm2, ymm3/m256/m16bcst OR AVX10.11 ymm3/m256/m16bcst, and store the result in
ymm1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a packed multiply-add or negated multiply-add computation on FP16 values using three
source operands and writes the results in the destination operand. The destination operand is also the first source
operand. The “N” (negated) forms of this instruction add the negated infinite precision intermediate product to the
corresponding remaining operand. The notation’ “132”, “213” and “231” indicate the use of the operands in ±A * B
+ C, where each digit corresponds to the operand number, with the destination being operand 1; see Table 5-5.
The destination elements are updated according to the writemask.
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-DEST.fp16[j]*SRC3.fp16[j] + SRC2.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j]*SRC3.fp16[j] + SRC2.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
VF[,N]MADD132PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-DEST.fp16[j] * t3 + SRC2.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j] * t3 + SRC2.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-SRC2.fp16[j]*DEST.fp16[j] + SRC3.fp16[j])
ELSE
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*DEST.fp16[j] + SRC3.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
VF[,N]MADD213PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-SRC2.fp16[j] * DEST.fp16[j] + t3 )
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * DEST.fp16[j] + t3 )
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *negative form:
DEST.fp16[j] := RoundFPControl(-SRC2.fp16[j]*SRC3.fp16[j] + DEST.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*SRC3.fp16[j] + DEST.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
VF[,N]MADD231PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-SRC2.fp16[j] * t3 + DEST.fp16[j] )
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * t3 + DEST.fp16[j] )
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-302
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.512.66.0F38.W0 9C /r B V/V (AVX512VL AND Multiply packed single precision floating-point values
VFNMADD132PS zmm1 {k1}{z}, AVX512F) OR from zmm1 and zmm3/m512/m32bcst, negate the
zmm2, zmm3/m512/m32bcst{er} AVX10.11 multiplication result and add to zmm2 and put result
in zmm1.
EVEX.512.66.0F38.W0 AC /r B V/V AVX512F OR Multiply packed single precision floating-point values
VFNMADD213PS zmm1 {k1}{z}, AVX10.11 from zmm1 and zmm2, negate the multiplication
zmm2, zmm3/m512/m32bcst{er} result and add to zmm3/m512/m32bcst and put
result in zmm1.
EVEX.512.66.0F38.W0 BC /r B V/V AVX512F OR Multiply packed single precision floating-point values
VFNMADD231PS zmm1 {k1}{z}, AVX10.11 from zmm2 and zmm3/m512/m32bcst, negate the
zmm2, zmm3/m512/m32bcst{er} multiplication result and add to zmm1 and put result
in zmm1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VFNMADD132PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the first
source operand to the four, eight or sixteen packed single precision floating-point values in the third source
operand, adds the negated infinite precision intermediate result to the four, eight or sixteen packed single precision
floating-point values in the second source operand, performs rounding and stores the resulting four, eight or
sixteen packed single precision floating-point values to the destination operand (first source operand).
VFNMADD213PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the second
source operand to the four, eight or sixteen packed single precision floating-point values in the first source
operand, adds the negated infinite precision intermediate result to the four, eight or sixteen packed single precision
floating-point values in the third source operand, performs rounding and stores the resulting the four, eight or
sixteen packed single precision floating-point values to the destination operand (first source operand).
VFNMADD231PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the second
source operand to the four, eight or sixteen packed single precision floating-point values in the third source
operand, adds the negated infinite precision intermediate result to the four, eight or sixteen packed single precision
floating-point values in the first source operand, performs rounding and stores the resulting four, eight or sixteen
packed single precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.
VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-303
Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).
VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-304
VFNMADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl(-(DEST[i+31:i]*SRC3[i+31:i]) + SRC2[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMADD132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(DEST[i+31:i]*SRC3[31:0]) + SRC2[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(DEST[i+31:i]*SRC3[i+31:i]) + SRC2[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-305
VFNMADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl(-(SRC2[i+31:i]*DEST[i+31:i]) + SRC3[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMADD213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*DEST[i+31:i]) + SRC3[31:0])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*DEST[i+31:i]) + SRC3[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-306
VFNMADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl(-(SRC2[i+31:i]*SRC3[i+31:i]) + DEST[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMADD231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*SRC3[31:0]) + DEST[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*SRC3[i+31:i]) + DEST[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-307
Intel C/C++ Compiler Intrinsic Equivalent
VFNMADDxxxPS __m512 _mm512_fnmadd_ps(__m512 a, __m512 b, __m512 c);
VFNMADDxxxPS __m512 _mm512_fnmadd_round_ps(__m512 a, __m512 b, __m512 c, int r);
VFNMADDxxxPS __m512 _mm512_mask_fnmadd_ps(__m512 a, __mmask16 k, __m512 b, __m512 c);
VFNMADDxxxPS __m512 _mm512_maskz_fnmadd_ps(__mmask16 k, __m512 a, __m512 b, __m512 c);
VFNMADDxxxPS __m512 _mm512_mask3_fnmadd_ps(__m512 a, __m512 b, __m512 c, __mmask16 k);
VFNMADDxxxPS __m512 _mm512_mask_fnmadd_round_ps(__m512 a, __mmask16 k, __m512 b, __m512 c, int r);
VFNMADDxxxPS __m512 _mm512_maskz_fnmadd_round_ps(__mmask16 k, __m512 a, __m512 b, __m512 c, int r);
VFNMADDxxxPS __m512 _mm512_mask3_fnmadd_round_ps(__m512 a, __m512 b, __m512 c, __mmask16 k, int r);
VFNMADDxxxPS __m256 _mm256_mask_fnmadd_ps(__m256 a, __mmask8 k, __m256 b, __m256 c);
VFNMADDxxxPS __m256 _mm256_maskz_fnmadd_ps(__mmask8 k, __m256 a, __m256 b, __m256 c);
VFNMADDxxxPS __m256 _mm256_mask3_fnmadd_ps(__m256 a, __m256 b, __m256 c, __mmask8 k);
VFNMADDxxxPS __m128 _mm_mask_fnmadd_ps(__m128 a, __mmask8 k, __m128 b, __m128 c);
VFNMADDxxxPS __m128 _mm_maskz_fnmadd_ps(__mmask8 k, __m128 a, __m128 b, __m128 c);
VFNMADDxxxPS __m128 _mm_mask3_fnmadd_ps(__m128 a, __m128 b, __m128 c, __mmask8 k);
VFNMADDxxxPS __m128 _mm_fnmadd_ps (__m128 a, __m128 b, __m128 c);
VFNMADDxxxPS __m256 _mm256_fnmadd_ps (__m256 a, __m256 b, __m256 c);
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VFNMADD132PS/VFNMADD213PS/VFNMADD231PS—Fused Negative Multiply-Add of Packed Single Precision Floating-Point Values Vol. 2C 5-308
VFNMADD132SD/VFNMADD213SD/VFNMADD231SD—Fused Negative Multiply-Add of Scalar
Double Precision Floating-Point Values
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
VEX.LIG.66.0F38.W1 9D /r A V/V FMA Multiply scalar double precision floating-point value from
VFNMADD132SD xmm1, xmm2, xmm1 and xmm3/mem, negate the multiplication result
xmm3/m64 and add to xmm2 and put result in xmm1.
VEX.LIG.66.0F38.W1 AD /r A V/V FMA Multiply scalar double precision floating-point value from
VFNMADD213SD xmm1, xmm2, xmm1 and xmm2, negate the multiplication result and add
xmm3/m64 to xmm3/mem and put result in xmm1.
VEX.LIG.66.0F38.W1 BD /r A V/V FMA Multiply scalar double precision floating-point value from
VFNMADD231SD xmm1, xmm2, xmm2 and xmm3/mem, negate the multiplication result
xmm3/m64 and add to xmm1 and put result in xmm1.
EVEX.LLIG.66.0F38.W1 9D /r B V/V AVX512F Multiply scalar double precision floating-point value from
VFNMADD132SD xmm1 {k1}{z}, OR AVX10.11 xmm1 and xmm3/m64, negate the multiplication result
xmm2, xmm3/m64{er} and add to xmm2 and put result in xmm1.
EVEX.LLIG.66.0F38.W1 AD /r B V/V AVX512F Multiply scalar double precision floating-point value from
VFNMADD213SD xmm1 {k1}{z}, OR AVX10.11 xmm1 and xmm2, negate the multiplication result and add
xmm2, xmm3/m64{er} to xmm3/m64 and put result in xmm1.
EVEX.LLIG.66.0F38.W1 BD /r B V/V AVX512F Multiply scalar double precision floating-point value from
VFNMADD231SD xmm1 {k1}{z}, OR AVX10.11 xmm2 and xmm3/m64, negate the multiplication result
xmm2, xmm3/m64{er} and add to xmm1 and put result in xmm1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VFNMADD132SD: Multiplies the low packed double precision floating-point value from the first source operand to
the low packed double precision floating-point value in the third source operand, adds the negated infinite precision
intermediate result to the low packed double precision floating-point values in the second source operand,
performs rounding and stores the resulting packed double precision floating-point value to the destination operand
(first source operand).
VFNMADD213SD: Multiplies the low packed double precision floating-point value from the second source operand
to the low packed double precision floating-point value in the first source operand, adds the negated infinite preci-
sion intermediate result to the low packed double precision floating-point value in the third source operand,
performs rounding and stores the resulting packed double precision floating-point value to the destination operand
(first source operand).
VFNMADD231SD: Multiplies the low packed double precision floating-point value from the second source to the low
packed double precision floating-point value in the third source operand, adds the negated infinite precision inter-
mediate result to the low packed double precision floating-point value in the first source operand, performs
rounding and stores the resulting packed double precision floating-point value to the destination operand (first
source operand).
VFNMADD132SD/VFNMADD213SD/VFNMADD231SD—Fused Negative Multiply-Add of Scalar Double Precision Floating-Point Values Vol. 2C 5-308
VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field.
The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field.
Bits 127:64 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed.
EVEX encoded version: The low quadword element of the destination is updated according to the writemask.
Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the
opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations
involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction
column.
Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).
VFNMADD132SD/VFNMADD213SD/VFNMADD231SD—Fused Negative Multiply-Add of Scalar Double Precision Floating-Point Values Vol. 2C 5-309
VFNMADD231SD DEST, SRC2, SRC3 (EVEX encoded version)
IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := RoundFPControl(-(SRC2[63:0]*SRC3[63:0]) + DEST[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
VFNMADD132SD/VFNMADD213SD/VFNMADD231SD—Fused Negative Multiply-Add of Scalar Double Precision Floating-Point Values Vol. 2C 5-310
VF[,N]MADD[132,213,231]SH—Fused Multiply-Add of Scalar FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.66.MAP6.W0 99 /r A V/V AVX512-FP16 Multiply FP16 values from xmm1 and
VFMADD132SH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm3/m16, add to xmm2, and store the result in
xmm3/m16 {er} xmm1.
EVEX.LLIG.66.MAP6.W0 A9 /r A V/V AVX512-FP16 Multiply FP16 values from xmm1 and xmm2, add
VFMADD213SH xmm1{k1}{z}, xmm2, OR AVX10.11 to xmm3/m16, and store the result in xmm1.
xmm3/m16 {er}
EVEX.LLIG.66.MAP6.W0 B9 /r A V/V AVX512-FP16 Multiply FP16 values from xmm2 and
VFMADD231SH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm3/m16, add to xmm1, and store the result in
xmm3/m16 {er} xmm1.
EVEX.LLIG.66.MAP6.W0 9D /r A V/V AVX512-FP16 Multiply FP16 values from xmm1 and
VFNMADD132SH xmm1{k1}{z}, OR AVX10.11 xmm3/m16, and negate the value. Add this value
xmm2, xmm3/m16 {er} to xmm2, and store the result in xmm1.
EVEX.LLIG.66.MAP6.W0 AD /r A V/V AVX512-FP16 Multiply FP16 values from xmm1 and xmm2, and
VFNMADD213SH xmm1{k1}{z}, OR AVX10.11 negate the value. Add this value to xmm3/m16,
xmm2, xmm3/m16 {er} and store the result in xmm1.
EVEX.LLIG.66.MAP6.W0 BD /r A V/V AVX512-FP16 Multiply FP16 values from xmm2 and
VFNMADD231SH xmm1{k1}{z}, OR AVX10.11 xmm3/m16, and negate the value. Add this value
xmm2, xmm3/m16 {er} to xmm1, and store the result in xmm1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a scalar multiply-add or negated multiply-add computation on the low FP16 values using three source
operands and writes the result in the destination operand. The destination operand is also the first source operand.
The “N” (negated) forms of this instruction add the negated infinite precision intermediate product to the corre-
sponding remaining operand. The notation’ “132”, “213” and “231” indicate the use of the operands in ±A * B + C,
where each digit corresponds to the operand number, with the destination being operand 1; see Table 5-6.
Bits 127:16 of the destination operand are preserved. Bits MAXVL-1:128 of the destination operand are zeroed. The
low FP16 element of the destination is updated according to the writemask.
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VFNMADD132SS: Multiplies the low packed single precision floating-point value from the first source operand to
the low packed single precision floating-point value in the third source operand, adds the negated infinite precision
intermediate result to the low packed single precision floating-point value in the second source operand, performs
rounding and stores the resulting packed single precision floating-point value to the destination operand (first
source operand).
VFNMADD213SS: Multiplies the low packed single precision floating-point value from the second source operand to
the low packed single precision floating-point value in the first source operand, adds the negated infinite precision
intermediate result to the low packed single precision floating-point value in the third source operand, performs
rounding and stores the resulting packed single precision floating-point value to the destination operand (first
source operand).
VFNMADD231SS: Multiplies the low packed single precision floating-point value from the second source operand to
the low packed single precision floating-point value in the third source operand, adds the negated infinite precision
intermediate result to the low packed single precision floating-point value in the first source operand, performs
rounding and stores the resulting packed single precision floating-point value to the destination operand (first
source operand).
VFNMADD132SS/VFNMADD213SS/VFNMADD231SS—Fused Negative Multiply-Add of Scalar Single Precision Floating-Point Values Vol. 2C 5-311
VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field.
The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field.
Bits 127:32 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed.
EVEX encoded version: The low doubleword element of the destination is updated according to the writemask.
Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the
opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations
involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction
column.
Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no
rounding).
VFNMADD132SS/VFNMADD213SS/VFNMADD231SS—Fused Negative Multiply-Add of Scalar Single Precision Floating-Point Values Vol. 2C 5-312
VFNMADD231SS DEST, SRC2, SRC3 (EVEX encoded version)
IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := RoundFPControl(-(SRC2[31:0]*SRC3[63:0]) + DEST[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
VFNMADD132SS/VFNMADD213SS/VFNMADD231SS—Fused Negative Multiply-Add of Scalar Single Precision Floating-Point Values Vol. 2C 5-313
VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD—Fused Negative Multiply-Subtract of
Packed Double Precision Floating-Point Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W1 9E /r A V/V FMA Multiply packed double precision floating-point values
VFNMSUB132PD xmm1, xmm2, from xmm1 and xmm3/mem, negate the
xmm3/m128 multiplication result and subtract xmm2 and put
result in xmm1.
VEX.128.66.0F38.W1 AE /r A V/V FMA Multiply packed double precision floating-point values
VFNMSUB213PD xmm1, xmm2, from xmm1 and xmm2, negate the multiplication
xmm3/m128 result and subtract xmm3/mem and put result in
xmm1.
VEX.128.66.0F38.W1 BE /r A V/V FMA Multiply packed double precision floating-point values
VFNMSUB231PD xmm1, xmm2, from xmm2 and xmm3/mem, negate the
xmm3/m128 multiplication result and subtract xmm1 and put
result in xmm1.
VEX.256.66.0F38.W1 9E /r A V/V FMA Multiply packed double precision floating-point values
VFNMSUB132PD ymm1, ymm2, from ymm1 and ymm3/mem, negate the
ymm3/m256 multiplication result and subtract ymm2 and put
result in ymm1.
VEX.256.66.0F38.W1 AE /r A V/V FMA Multiply packed double precision floating-point values
VFNMSUB213PD ymm1, ymm2, from ymm1 and ymm2, negate the multiplication
ymm3/m256 result and subtract ymm3/mem and put result in
ymm1.
VEX.256.66.0F38.W1 BE /r A V/V FMA Multiply packed double precision floating-point values
VFNMSUB231PD ymm1, ymm2, from ymm2 and ymm3/mem, negate the
ymm3/m256 multiplication result and subtract ymm1 and put
result in ymm1.
EVEX.128.66.0F38.W1 9E /r B V/V (AVX512VL AND Multiply packed double precision floating-point values
VFNMSUB132PD xmm1 {k1}{z}, AVX512F) OR from xmm1 and xmm3/m128/m64bcst, negate the
xmm2, xmm3/m128/m64bcst AVX10.11 multiplication result and subtract xmm2 and put
result in xmm1.
EVEX.128.66.0F38.W1 AE /r B V/V (AVX512VL AND Multiply packed double precision floating-point values
VFNMSUB213PD xmm1 {k1}{z}, AVX512F) OR from xmm1 and xmm2, negate the multiplication
xmm2, xmm3/m128/m64bcst AVX10.11 result and subtract xmm3/m128/m64bcst and put
result in xmm1.
EVEX.128.66.0F38.W1 BE /r B V/V (AVX512VL AND Multiply packed double precision floating-point values
VFNMSUB231PD xmm1 {k1}{z}, AVX512F) OR from xmm2 and xmm3/m128/m64bcst, negate the
xmm2, xmm3/m128/m64bcst AVX10.11 multiplication result and subtract xmm1 and put
result in xmm1.
EVEX.256.66.0F38.W1 9E /r B V/V (AVX512VL AND Multiply packed double precision floating-point values
VFNMSUB132PD ymm1 {k1}{z}, AVX512F) OR from ymm1 and ymm3/m256/m64bcst, negate the
ymm2, ymm3/m256/m64bcst AVX10.11 multiplication result and subtract ymm2 and put
result in ymm1.
EVEX.256.66.0F38.W1 AE /r B V/V (AVX512VL AND Multiply packed double precision floating-point values
VFNMSUB213PD ymm1 {k1}{z}, AVX512F) OR from ymm1 and ymm2, negate the multiplication
ymm2, ymm3/m256/m64bcst AVX10.11 result and subtract ymm3/m256/m64bcst and put
result in ymm1.
EVEX.256.66.0F38.W1 BE /r B V/V (AVX512VL AND Multiply packed double precision floating-point values
VFNMSUB231PD ymm1 {k1}{z}, AVX512F) OR from ymm2 and ymm3/m256/m64bcst, negate the
ymm2, ymm3/m256/m64bcst AVX10.11 multiplication result and subtract ymm1 and put
result in ymm1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VFNMSUB132PD: Multiplies the two, four or eight packed double precision floating-point values from the first
source operand to the two, four or eight packed double precision floating-point values in the third source operand.
From negated infinite precision intermediate results, subtracts the two, four or eight packed double precision
floating-point values in the second source operand, performs rounding and stores the resulting two, four or eight
packed double precision floating-point values to the destination operand (first source operand).
VFNMSUB213PD: Multiplies the two, four or eight packed double precision floating-point values from the second
source operand to the two, four or eight packed double precision floating-point values in the first source operand.
From negated infinite precision intermediate results, subtracts the two, four or eight packed double precision
floating-point values in the third source operand, performs rounding and stores the resulting two, four or eight
packed double precision floating-point values to the destination operand (first source operand).
VFNMSUB231PD: Multiplies the two, four or eight packed double precision floating-point values from the second
source to the two, four or eight packed double precision floating-point values in the third source operand. From
negated infinite precision intermediate results, subtracts the two, four or eight packed double precision floating-
point values in the first source operand, performs rounding and stores the resulting two, four or eight packed
double precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.
VFNMSUB132PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(DEST[i+63:i]*SRC3[63:0]) - SRC2[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(DEST[i+63:i]*SRC3[i+63:i]) - SRC2[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMSUB213PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(SRC2[i+63:i]*DEST[i+63:i]) - SRC3[63:0])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(SRC2[i+63:i]*DEST[i+63:i]) - SRC3[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMSUB231PD DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(SRC2[i+63:i]*SRC3[63:0]) - DEST[i+63:i])
ELSE
DEST[i+63:i] :=
RoundFPControl_MXCSR(-(SRC2[i+63:i]*SRC3[i+63:i]) - DEST[i+63:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a packed multiply-subtract or a negated multiply-subtract computation on FP16 values
using three source operands and writes the results in the destination operand. The destination operand is also the
first source operand. The “N” (negated) forms of this instruction subtract the remaining operand from the negated
infinite precision intermediate product. The notation’ “132”, “213” and “231” indicate the use of the operands in ±A
* B − C, where each digit corresponds to the operand number, with the destination being operand 1; see Table 5-7.
The destination elements are updated according to the writemask.
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-DEST.fp16[j]*SRC3.fp16[j] - SRC2.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j]*SRC3.fp16[j] - SRC2.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
VF[,N]MSUB132PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-DEST.fp16[j] * t3 - SRC2.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(DEST.fp16[j] * t3 - SRC2.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-SRC2.fp16[j]*DEST.fp16[j] - SRC3.fp16[j])
ELSE
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*DEST.fp16[j] - SRC3.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
VF[,N]MSUB213PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-SRC2.fp16[j] * DEST.fp16[j] - t3 )
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * DEST.fp16[j] - t3 )
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF *negative form:
DEST.fp16[j] := RoundFPControl(-SRC2.fp16[j]*SRC3.fp16[j] - DEST.fp16[j])
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j]*SRC3.fp16[j] - DEST.fp16[j])
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
VF[,N]MSUB231PH DEST, SRC2, SRC3 (EVEX encoded versions) when src3 operand is a memory source
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
t3 := SRC3.fp16[0]
ELSE:
t3 := SRC3.fp16[j]
IF *negative form*:
DEST.fp16[j] := RoundFPControl(-SRC2.fp16[j] * t3 - DEST.fp16[j] )
ELSE:
DEST.fp16[j] := RoundFPControl(SRC2.fp16[j] * t3 - DEST.fp16[j] )
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single Precision Floating-Point Val- Vol. 2C 5-321
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.512.66.0F38.W0 9E /r B V/V AVX512F Multiply packed single-precision floating-point values
VFNMSUB132PS zmm1 {k1}{z}, OR AVX10.11 from zmm1 and zmm3/m512/m32bcst, negate the
zmm2, zmm3/m512/m32bcst{er} multiplication result and subtract zmm2 and put
result in zmm1.
EVEX.512.66.0F38.W0 AE /r B V/V AVX512F Multiply packed single-precision floating-point values
VFNMSUB213PS zmm1 {k1}{z}, OR AVX10.11 from zmm1 and zmm2, negate the multiplication
zmm2, zmm3/m512/m32bcst{er} result and subtract zmm3/m512/m32bcst and put
result in zmm1.
EVEX.512.66.0F38.W0 BE /r B V/V AVX512F Multiply packed single-precision floating-point values
VFNMSUB231PS zmm1 {k1}{z}, OR AVX10.11 from zmm2 and zmm3/m512/m32bcst, negate the
zmm2, zmm3/m512/m32bcst{er} multiplication result subtract add to zmm1 and put
result in zmm1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VFNMSUB132PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the first
source operand to the four, eight or sixteen packed single precision floating-point values in the third source
operand. From negated infinite precision intermediate results, subtracts the four, eight or sixteen packed single
precision floating-point values in the second source operand, performs rounding and stores the resulting four, eight
or sixteen packed single precision floating-point values to the destination operand (first source operand).
VFNMSUB213PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the second
source operand to the four, eight or sixteen packed single precision floating-point values in the first source
operand. From negated infinite precision intermediate results, subtracts the four, eight or sixteen packed single
precision floating-point values in the third source operand, performs rounding and stores the resulting four, eight
or sixteen packed single precision floating-point values to the destination operand (first source operand).
VFNMSUB231PS: Multiplies the four, eight or sixteen packed single precision floating-point values from the second
source to the four, eight or sixteen packed single precision floating-point values in the third source operand. From
negated infinite precision intermediate results, subtracts the four, eight or sixteen packed single precision floating-
point values in the first source operand, performs rounding and stores the resulting four, eight or sixteen packed
single precision floating-point values to the destination operand (first source operand).
EVEX encoded versions: The destination operand (also first source operand) and the second source operand are
ZMM/YMM/XMM register. The third source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory loca-
tion or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is condition-
ally updated with write mask k1.
VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in
reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a
YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in
reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a
XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.
VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single Precision Floating-Point Val- Vol. 2C 5-322
Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).
VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single Precision Floating-Point Val- Vol. 2C 5-323
VFNMSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl(-(DEST[i+31:i]*SRC3[i+31:i]) - SRC2[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMSUB132PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(DEST[i+31:i]*SRC3[31:0]) - SRC2[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(DEST[i+31:i]*SRC3[i+31:i]) - SRC2[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single Precision Floating-Point Val- Vol. 2C 5-324
VFNMSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*DEST[i+31:i]) - SRC3[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMSUB213PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*DEST[i+31:i]) - SRC3[31:0])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*DEST[i+31:i]) - SRC3[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single Precision Floating-Point Val- Vol. 2C 5-325
VFNMSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a register)
(KL, VL) = (4, 128), (8, 256), (16, 512)
IF (VL = 512) AND (EVEX.b = 1)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*SRC3[i+31:i]) - DEST[i+31:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMSUB231PS DEST, SRC2, SRC3 (EVEX encoded version, when src3 operand is a memory source)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
IF (EVEX.b = 1)
THEN
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*SRC3[31:0]) - DEST[i+31:i])
ELSE
DEST[i+31:i] :=
RoundFPControl_MXCSR(-(SRC2[i+31:i]*SRC3[i+31:i]) - DEST[i+31:i])
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single Precision Floating-Point Val- Vol. 2C 5-326
Intel C/C++ Compiler Intrinsic Equivalent
VFNMSUBxxxPS __m512 _mm512_fnmsub_ps(__m512 a, __m512 b, __m512 c);
VFNMSUBxxxPS __m512 _mm512_fnmsub_round_ps(__m512 a, __m512 b, __m512 c, int r);
VFNMSUBxxxPS __m512 _mm512_mask_fnmsub_ps(__m512 a, __mmask16 k, __m512 b, __m512 c);
VFNMSUBxxxPS __m512 _mm512_maskz_fnmsub_ps(__mmask16 k, __m512 a, __m512 b, __m512 c);
VFNMSUBxxxPS __m512 _mm512_mask3_fnmsub_ps(__m512 a, __m512 b, __m512 c, __mmask16 k);
VFNMSUBxxxPS __m512 _mm512_mask_fnmsub_round_ps(__m512 a, __mmask16 k, __m512 b, __m512 c, int r);
VFNMSUBxxxPS __m512 _mm512_maskz_fnmsub_round_ps(__mmask16 k, __m512 a, __m512 b, __m512 c, int r);
VFNMSUBxxxPS __m512 _mm512_mask3_fnmsub_round_ps(__m512 a, __m512 b, __m512 c, __mmask16 k, int r);
VFNMSUBxxxPS __m256 _mm256_mask_fnmsub_ps(__m256 a, __mmask8 k, __m256 b, __m256 c);
VFNMSUBxxxPS __m256 _mm256_maskz_fnmsub_ps(__mmask8 k, __m256 a, __m256 b, __m256 c);
VFNMSUBxxxPS __m256 _mm256_mask3_fnmsub_ps(__m256 a, __m256 b, __m256 c, __mmask8 k);
VFNMSUBxxxPS __m128 _mm_mask_fnmsub_ps(__m128 a, __mmask8 k, __m128 b, __m128 c);
VFNMSUBxxxPS __m128 _mm_maskz_fnmsub_ps(__mmask8 k, __m128 a, __m128 b, __m128 c);
VFNMSUBxxxPS __m128 _mm_mask3_fnmsub_ps(__m128 a, __m128 b, __m128 c, __mmask8 k);
VFNMSUBxxxPS __m128 _mm_fnmsub_ps (__m128 a, __m128 b, __m128 c);
VFNMSUBxxxPS __m256 _mm256_fnmsub_ps (__m256 a, __m256 b, __m256 c);
Other Exceptions
VEX-encoded instructions, see Table 2-19, “Type 2 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS—Fused Negative Multiply-Subtract of Packed Single Precision Floating-Point Val- Vol. 2C 5-327
VFNMSUB132SD/VFNMSUB213SD/VFNMSUB231SD—Fused Negative Multiply-Subtract of
Scalar Double Precision Floating-Point Values
Opcode/ Op / 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
VEX.LIG.66.0F38.W1 9F /r A V/V FMA Multiply scalar double precision floating-point value from
VFNMSUB132SD xmm1, xmm2, xmm1 and xmm3/mem, negate the multiplication result
xmm3/m64 and subtract xmm2 and put result in xmm1.
VEX.LIG.66.0F38.W1 AF /r A V/V FMA Multiply scalar double precision floating-point value from
VFNMSUB213SD xmm1, xmm2, xmm1 and xmm2, negate the multiplication result and
xmm3/m64 subtract xmm3/mem and put result in xmm1.
VEX.LIG.66.0F38.W1 BF /r A V/V FMA Multiply scalar double precision floating-point value from
VFNMSUB231SD xmm1, xmm2, xmm2 and xmm3/mem, negate the multiplication result
xmm3/m64 and subtract xmm1 and put result in xmm1.
EVEX.LLIG.66.0F38.W1 9F /r B V/V AVX512F Multiply scalar double precision floating-point value from
VFNMSUB132SD xmm1 {k1}{z}, OR AVX10.11 xmm1 and xmm3/m64, negate the multiplication result
xmm2, xmm3/m64{er} and subtract xmm2 and put result in xmm1.
EVEX.LLIG.66.0F38.W1 AF /r B V/V AVX512F Multiply scalar double precision floating-point value from
VFNMSUB213SD xmm1 {k1}{z}, OR AVX10.11 xmm1 and xmm2, negate the multiplication result and
xmm2, xmm3/m64{er} subtract xmm3/m64 and put result in xmm1.
EVEX.LLIG.66.0F38.W1 BF /r B V/V AVX512F Multiply scalar double precision floating-point value from
VFNMSUB231SD xmm1 {k1}{z}, OR AVX10.11 xmm2 and xmm3/m64, negate the multiplication result
xmm2, xmm3/m64{er} and subtract xmm1 and put result in xmm1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VFNMSUB132SD: Multiplies the low packed double precision floating-point value from the first source operand to
the low packed double precision floating-point value in the third source operand. From negated infinite precision
intermediate result, subtracts the low double precision floating-point value in the second source operand, performs
rounding and stores the resulting packed double precision floating-point value to the destination operand (first
source operand).
VFNMSUB213SD: Multiplies the low packed double precision floating-point value from the second source operand
to the low packed double precision floating-point value in the first source operand. From negated infinite precision
intermediate result, subtracts the low double precision floating-point value in the third source operand, performs
rounding and stores the resulting packed double precision floating-point value to the destination operand (first
source operand).
VFNMSUB231SD: Multiplies the low packed double precision floating-point value from the second source to the low
packed double precision floating-point value in the third source operand. From negated infinite precision interme-
diate result, subtracts the low double precision floating-point value in the first source operand, performs rounding
and stores the resulting packed double precision floating-point value to the destination operand (first source
operand).
VFNMSUB132SD/VFNMSUB213SD/VFNMSUB231SD—Fused Negative Multiply-Subtract of Scalar Double Precision Floating-Point Val- Vol. 2C 5-327
VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field.
The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field.
Bits 127:64 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed.
EVEX encoded version: The low quadword element of the destination is updated according to the writemask.
Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the
opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations
involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction
column.
Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).
VFNMSUB132SD/VFNMSUB213SD/VFNMSUB231SD—Fused Negative Multiply-Subtract of Scalar Double Precision Floating-Point Val- Vol. 2C 5-328
VFNMSUB231SD DEST, SRC2, SRC3 (EVEX encoded version)
IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[63:0] := RoundFPControl(-(SRC2[63:0]*SRC3[63:0]) - DEST[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := DEST[127:64]
DEST[MAXVL-1:128] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
VFNMSUB132SD/VFNMSUB213SD/VFNMSUB231SD—Fused Negative Multiply-Subtract of Scalar Double Precision Floating-Point Val- Vol. 2C 5-329
VF[,N]MSUB[132,213,231]SH—Fused Multiply-Subtract of Scalar FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.66.MAP6.W0 9B /r A V/V AVX512-FP16 Multiply FP16 values from xmm1 and
VFMSUB132SH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm3/m16, subtract xmm2, and store the result
xmm3/m16 {er} in xmm1 subject to writemask k1.
EVEX.LLIG.66.MAP6.W0 AB /r A V/V AVX512-FP16 Multiply FP16 values from xmm1 and xmm2,
VFMSUB213SH xmm1{k1}{z}, xmm2, OR AVX10.11 subtract xmm3/m16, and store the result in
xmm3/m16 {er} xmm1 subject to writemask k1.
EVEX.LLIG.66.MAP6.W0 BB /r A V/V AVX512-FP16 Multiply FP16 values from xmm2 and
VFMSUB231SH xmm1{k1}{z}, xmm2, OR AVX10.11 xmm3/m16, subtract xmm1, and store the result
xmm3/m16 {er} in xmm1 subject to writemask k1.
EVEX.LLIG.66.MAP6.W0 9F /r A V/V AVX512-FP16 Multiply FP16 values from xmm1 and
VFNMSUB132SH xmm1{k1}{z}, OR AVX10.11 xmm3/m16, and negate the value. Subtract
xmm2, xmm3/m16 {er} xmm2 from this value, and store the result in
xmm1 subject to writemask k1.
EVEX.LLIG.66.MAP6.W0 AF /r A V/V AVX512-FP16 Multiply FP16 values from xmm1 and xmm2, and
VFNMSUB213SH xmm1{k1}{z}, OR AVX10.11 negate the value. Subtract xmm3/m16 from this
xmm2, xmm3/m16 {er} value, and store the result in xmm1 subject to
writemask k1.
EVEX.LLIG.66.MAP6.W0 BF /r A V/V AVX512-FP16 Multiply FP16 values from xmm2 and
VFNMSUB231SH xmm1{k1}{z}, OR AVX10.11 xmm3/m16, and negate the value. Subtract
xmm2, xmm3/m16 {er} xmm1 from this value, and store the result in
xmm1 subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a scalar multiply-subtract or negated multiply-subtract computation on the low FP16
values using three source operands and writes the result in the destination operand. The destination operand is also
the first source operand. The “N” (negated) forms of this instruction subtract the remaining operand from the
negated infinite precision intermediate product. The notation’ “132”, “213” and “231” indicate the use of the oper-
ands in ±A * B − C, where each digit corresponds to the operand number, with the destination being operand 1;
see Table 5-8.
Bits 127:16 of the destination operand are preserved. Bits MAXVL-1:128 of the destination operand are zeroed. The
low FP16 element of the destination is updated according to the writemask.
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VFNMSUB132SS: Multiplies the low packed single precision floating-point value from the first source operand to
the low packed single precision floating-point value in the third source operand. From negated infinite precision
intermediate result, the low single precision floating-point value in the second source operand, performs rounding
and stores the resulting packed single precision floating-point value to the destination operand (first source
operand).
VFNMSUB213SS: Multiplies the low packed single precision floating-point value from the second source operand to
the low packed single precision floating-point value in the first source operand. From negated infinite precision
intermediate result, the low single precision floating-point value in the third source operand, performs rounding
and stores the resulting packed single precision floating-point value to the destination operand (first source
operand).
VFNMSUB231SS: Multiplies the low packed single precision floating-point value from the second source to the low
packed single precision floating-point value in the third source operand. From negated infinite precision interme-
diate result, the low single precision floating-point value in the first source operand, performs rounding and stores
the resulting packed single precision floating-point value to the destination operand (first source operand).
VEX.128 and EVEX encoded version: The destination operand (also first source operand) is encoded in reg_field.
The second source operand is encoded in VEX.vvvv/EVEX.vvvv. The third source operand is encoded in rm_field.
Bits 127:32 of the destination are unchanged. Bits MAXVL-1:128 of the destination register are zeroed.
VFNMSUB132SS/VFNMSUB213SS/VFNMSUB231SS—Fused Negative Multiply-Subtract of Scalar Single Precision Floating-Point Val- Vol. 2C 5-330
EVEX encoded version: The low doubleword element of the destination is updated according to the writemask.
Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the
opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations
involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction
column.
Operation
In the operations below, “*” and “-” symbols represent multiplication and subtraction with infinite precision inputs and outputs (no
rounding).
VFNMSUB132SS/VFNMSUB213SS/VFNMSUB231SS—Fused Negative Multiply-Subtract of Scalar Single Precision Floating-Point Val- Vol. 2C 5-331
VFNMSUB231SS DEST, SRC2, SRC3 (EVEX encoded version)
IF (EVEX.b = 1) and SRC3 *is a register*
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
IF k1[0] or *no writemask*
THEN DEST[31:0] := RoundFPControl(-(SRC2[31:0]*SRC3[63:0]) - DEST[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := DEST[127:32]
DEST[MAXVL-1:128] := 0
Other Exceptions
VEX-encoded instructions, see Table 2-20, “Type 3 Class Exception Conditions.”
VFNMSUB132SS/VFNMSUB213SS/VFNMSUB231SS—Fused Negative Multiply-Subtract of Scalar Single Precision Floating-Point Val- Vol. 2C 5-332
VFPCLASSPD—Tests Types of Packed Float64 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F3A.W1 66 /r ib A V/V (AVX512VL AND Tests the input for the following categories: NaN, +0, -
VFPCLASSPD k2 {k1}, AVX512DQ) OR 0, +Infinity, -Infinity, denormal, finite negative. The
xmm2/m128/m64bcst, imm8 AVX10.11 immediate field provides a mask bit for each of these
category tests. The masked test results are OR-ed
together to form a mask result.
EVEX.256.66.0F3A.W1 66 /r ib A V/V (AVX512VL AND Tests the input for the following categories: NaN, +0, -
VFPCLASSPD k2 {k1}, AVX512DQ) OR 0, +Infinity, -Infinity, denormal, finite negative. The
ymm2/m256/m64bcst, imm8 AVX10.11 immediate field provides a mask bit for each of these
category tests. The masked test results are OR-ed
together to form a mask result.
EVEX.512.66.0F3A.W1 66 /r ib A V/V AVX512DQ Tests the input for the following categories: NaN, +0, -
VFPCLASSPD k2 {k1}, OR AVX10.11 0, +Infinity, -Infinity, denormal, finite negative. The
zmm2/m512/m64bcst, imm8 immediate field provides a mask bit for each of these
category tests. The masked test results are OR-ed
together to form a mask result.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
The FPCLASSPD instruction checks the packed double precision floating-point values for special categories, speci-
fied by the set bits in the imm8 byte. Each set bit in imm8 specifies a category of floating-point values that the
input data element is classified against. The classified results of all specified categories of an input value are ORed
together to form the final boolean result for the input element. The result of each element is written to the corre-
sponding bit in a mask register k2 according to the writemask k1. Bits [MAX_KL-1:8/4/2] of the destination are
cleared.
The classification categories specified by imm8 are shown in Figure 5-13. The classification test for each category
is listed in Table 5-11.
7 6 5 4 3 2 1 0
Figure 5-13. Imm8 Byte Specifier of Special Case Floating-Point Values for VFPCLASSPD/SD/PS/SS
The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector
broadcasted from a 64-bit memory location.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
CheckFPClassDP (tsrc[63:0], imm8[7:0]){
//* Start checking the source operand for special type *//
NegNum := tsrc[63];
IF (tsrc[62:52]=07FFh) Then ExpAllOnes := 1; FI;
IF (tsrc[62:52]=0h) Then ExpAllZeros := 1;
IF (ExpAllZeros AND MXCSR.DAZ) Then
MantAllZeros := 1;
ELSIF (tsrc[51:0]=0h) Then
MantAllZeros := 1;
FI;
ZeroNumber := ExpAllZeros AND MantAllZeros
SignalingBit := tsrc[51];
None.
Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction checks the packed FP16 values in the source operand for special categories, specified by the set
bits in the imm8 byte. Each set bit in imm8 specifies a category of floating-point values that the input data element
is classified against; see Table 5-12 for the categories. The classified results of all specified categories of an input
value are ORed together to form the final boolean result for the input element. The result is written to the corre-
sponding bits in the destination mask register according to the writemask.
FOR i := 0 to KL-1:
IF k2[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := SRC.fp16[0]
ELSE:
tsrc := SRC.fp16[i]
DEST.bit[i] := check_fp_class_fp16(tsrc, imm8)
ELSE:
DEST.bit[i] := 0
DEST[MAXKL-1:kl] := 0
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
The FPCLASSPS instruction checks the packed single precision floating-point values for special categories, specified
by the set bits in the imm8 byte. Each set bit in imm8 specifies a category of floating-point values that the input
data element is classified against. The classified results of all specified categories of an input value are ORed
together to form the final boolean result for the input element. The result of each element is written to the corre-
sponding bit in a mask register k2 according to the writemask k1. Bits [MAX_KL-1:16/8/4] of the destination are
cleared.
The classification categories specified by imm8 are shown in Figure 5-13. The classification test for each category
is listed in Table 5-11.
The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector
broadcasted from a 32-bit memory location.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
CheckFPClassSP (tsrc[31:0], imm8[7:0]){
//* Start checking the source operand for special type *//
NegNum := tsrc[31];
IF (tsrc[30:23]=0FFh) Then ExpAllOnes := 1; FI;
IF (tsrc[30:23]=0h) Then ExpAllZeros := 1;
IF (ExpAllZeros AND MXCSR.DAZ) Then
MantAllZeros := 1;
ELSIF (tsrc[22:0]=0h) Then
None.
Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
The FPCLASSSD instruction checks the low double precision floating-point value in the source operand for special
categories, specified by the set bits in the imm8 byte. Each set bit in imm8 specifies a category of floating-point
values that the input data element is classified against. The classified results of all specified categories of an input
value are ORed together to form the final boolean result for the input element. The result is written to the low bit
in a mask register k2 according to the writemask k1. Bits MAX_KL-1: 1 of the destination are cleared.
The classification categories specified by imm8 are shown in Figure 5-13. The classification test for each category
is listed in Table 5-11.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
CheckFPClassDP (tsrc[63:0], imm8[7:0]){
NegNum := tsrc[63];
IF (tsrc[62:52]=07FFh) Then ExpAllOnes := 1; FI;
IF (tsrc[62:52]=0h) Then ExpAllZeros := 1;
IF (ExpAllZeros AND MXCSR.DAZ) Then
MantAllZeros := 1;
ELSIF (tsrc[51:0]=0h) Then
MantAllZeros := 1;
FI;
ZeroNumber := ExpAllZeros AND MantAllZeros
SignalingBit := tsrc[51];
None.
Other Exceptions
See Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction checks the low FP16 value in the source operand for special categories, specified by the set bits in
the imm8 byte. Each set bit in imm8 specifies a category of floating-point values that the input data element is clas-
sified against; see Table 5-12 for the categories. The classified results of all specified categories of an input value
are ORed together to form the final boolean result for the input element. The result is written to the low bit in the
destination mask register according to the writemask. The other bits in the destination mask register are zeroed.
Operation
VFPCLASSSH dest{k2}, src, imm8
IF k2[0] or *no writemask*:
DEST.bit[0] := check_fp_class_fp16(src.fp16[0], imm8) // see VFPCLASSPH
ELSE:
DEST.bit[0] := 0
DEST[MAXKL-1:1] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-60, “Type E10 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
The FPCLASSSS instruction checks the low single precision floating-point value in the source operand for special
categories, specified by the set bits in the imm8 byte. Each set bit in imm8 specifies a category of floating-point
values that the input data element is classified against. The classified results of all specified categories of an input
value are ORed together to form the final boolean result for the input element. The result is written to the low bit
in a mask register k2 according to the writemask k1. Bits MAX_KL-1: 1 of the destination are cleared.
The classification categories specified by imm8 are shown in Figure 5-13. The classification test for each category
is listed in Table 5-11.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
CheckFPClassSP (tsrc[31:0], imm8[7:0]){
//* Start checking the source operand for special type *//
NegNum := tsrc[31];
IF (tsrc[30:23]=0FFh) Then ExpAllOnes := 1; FI;
IF (tsrc[30:23]=0h) Then ExpAllZeros := 1;
IF (ExpAllZeros AND MXCSR.DAZ) Then
MantAllZeros := 1;
ELSIF (tsrc[22:0]=0h) Then
MantAllZeros := 1;
FI;
ZeroNumber= ExpAllZeros AND MantAllZeros
SignalingBit= tsrc[22];
None.
Other Exceptions
See Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
A set of single precision/double precision faulting-point memory locations pointed by base address BASE_ADDR
and index vector V_INDEX with scale SCALE are gathered. The result is written into a vector register. The elements
are specified via the VSIB (i.e., the index register is a vector register, holding packed indices). Elements will only
be loaded if their corresponding mask bit is one. If an element’s mask bit is not set, the corresponding element of
the destination register is left unchanged. The entire mask register will be set to zero by this instruction unless it
triggers an exception.
This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception
is triggered by an element other than the right most one with its mask bit set). When this happens, the destination
register and the mask register (k1) are partially updated; those elements that have been gathered are placed into
the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already
gathered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruc-
tion breakpoint is not re-triggered when the instruction is continued.
If the data element size is less than the index element size, the higher part of the destination register and the mask
register do not correspond to any elements being gathered. This instruction sets those higher parts to zero. It may
update these unused elements to one or both of those registers even if the instruction triggers an exception, and
even if the instruction triggers the exception before gathering any elements.
Note that:
• The values may be read from memory in any order. Memory ordering with other instructions follows the Intel-
64 memory-ordering model.
VGATHERDPS/VGATHERDPD—Gather Packed Single, Packed Double with Signed Dword Indices Vol. 2C 5-350
• Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all
elements closer to the LSB of the destination zmm will be completed (and non-faulting). Individual elements
closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered
in the conventional order.
• Elements may be gathered in any order, but faults must be delivered in a right-to left order; thus, elements to
the left of a faulting one may be gathered before the fault is delivered. A given implementation of this
instruction is repeatable - given the same input values and architectural state, the same set of elements to the
left of the faulting one will be gathered.
• This instruction does not perform AC checks, and so will never deliver an AC fault.
• Not valid with 16-bit effective addresses. Will deliver a #UD fault.
Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if
ModRM.rm is different than 100b.
This instruction has special disp8*N and alignment rules. N is considered to be the size of a single vector element.
The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit
mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are
ignored.
The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX. The instruction
will #UD fault if the k0 mask register is specified.
Operation
BASE_ADDR stands for the memory operand base address (a GPR); may not exist
VINDEX stands for the memory operand vector of indices (a vector register)
SCALE stands for the memory operand scalar (1, 2, 4 or 8)
DISP is the optional 1 or 4 byte displacement
VGATHERDPS/VGATHERDPD—Gather Packed Single, Packed Double with Signed Dword Indices Vol. 2C 5-351
DEST[MAXVL-1:VL] := 0
Other Exceptions
See Table 2-63, “Type E12 Class Exception Conditions.”
VGATHERDPS/VGATHERDPD—Gather Packed Single, Packed Double with Signed Dword Indices Vol. 2C 5-352
VGATHERQPS/VGATHERQPD—Gather Packed Single, Packed Double with Signed Qword Indices
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F38.W0 93 /vsib A V/V (AVX512VL AND Using signed qword indices, gather single-precision
VGATHERQPS xmm1 {k1}, vm64x AVX512F) OR floating-point values from memory using k1 as
AVX10.11 completion mask.
EVEX.256.66.0F38.W0 93 /vsib A V/V (AVX512VL AND Using signed qword indices, gather single-precision
VGATHERQPS xmm1 {k1}, vm64y AVX512F) OR floating-point values from memory using k1 as
AVX10.11 completion mask.
EVEX.512.66.0F38.W0 93 /vsib A V/V AVX512F Using signed qword indices, gather single-precision
VGATHERQPS ymm1 {k1}, vm64z OR AVX10.11 floating-point values from memory using k1 as
completion mask.
EVEX.128.66.0F38.W1 93 /vsib A V/V (AVX512VL AND Using signed qword indices, gather float64 vector into
VGATHERQPD xmm1 {k1}, vm64x AVX512F) OR float64 vector xmm1 using k1 as completion mask.
AVX10.11
EVEX.256.66.0F38.W1 93 /vsib A V/V (AVX512VL AND Using signed qword indices, gather float64 vector into
VGATHERQPD ymm1 {k1}, vm64y AVX512F) OR float64 vector ymm1 using k1 as completion mask.
AVX10.11
EVEX.512.66.0F38.W1 93 /vsib A V/V AVX512F Using signed qword indices, gather float64 vector into
VGATHERQPD zmm1 {k1}, vm64z OR AVX10.11 float64 vector zmm1 using k1 as completion mask.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
A set of 8 single precision/double precision faulting-point memory locations pointed by base address BASE_ADDR
and index vector V_INDEX with scale SCALE are gathered. The result is written into vector a register. The elements
are specified via the VSIB (i.e., the index register is a vector register, holding packed indices). Elements will only
be loaded if their corresponding mask bit is one. If an element’s mask bit is not set, the corresponding element of
the destination register is left unchanged. The entire mask register will be set to zero by this instruction unless it
triggers an exception.
This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception
is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination
register and the mask register (k1) are partially updated; those elements that have been gathered are placed into
the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already
gathered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruc-
tion breakpoint is not re-triggered when the instruction is continued.
If the data element size is less than the index element size, the higher part of the destination register and the mask
register do not correspond to any elements being gathered. This instruction sets those higher parts to zero. It may
update these unused elements to one or both of those registers even if the instruction triggers an exception, and
even if the instruction triggers the exception before gathering any elements.
Note that:
• The values may be read from memory in any order. Memory ordering with other instructions follows the Intel-
64 memory-ordering model.
VGATHERQPS/VGATHERQPD—Gather Packed Single, Packed Double with Signed Qword Indices Vol. 2C 5-357
• Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all
elements closer to the LSB of the destination zmm will be completed (and non-faulting). Individual elements
closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered
in the conventional order.
• Elements may be gathered in any order, but faults must be delivered in a right-to left order; thus, elements to
the left of a faulting one may be gathered before the fault is delivered. A given implementation of this
instruction is repeatable - given the same input values and architectural state, the same set of elements to the
left of the faulting one will be gathered.
• This instruction does not perform AC checks, and so will never deliver an AC fault.
• Not valid with 16-bit effective addresses. Will deliver a #UD fault.
Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if
ModRM.rm is different than 100b.
This instruction has special disp8*N and alignment rules. N is considered to be the size of a single vector element.
The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit
mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are
ignored.
The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX. The instruction
will #UD fault if the k0 mask register is specified.
Operation
BASE_ADDR stands for the memory operand base address (a GPR); may not exist
VINDEX stands for the memory operand vector of indices (a ZMM register)
SCALE stands for the memory operand scalar (1, 2, 4 or 8)
DISP is the optional 1 or 4 byte displacement
i := j * 32
k := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] :=
MEM[BASE_ADDR + (VINDEX[k+63:k]) * SCALE + DISP]
k1[j] := 0
ELSE *DEST[i+31:i] := remains unchanged*
FI;
ENDFOR
k1[MAX_KL-1:KL] := 0
DEST[MAXVL-1:VL/2] := 0
VGATHERQPS/VGATHERQPD—Gather Packed Single, Packed Double with Signed Qword Indices Vol. 2C 5-358
DEST[MAXVL-1:VL] := 0
Other Exceptions
See Table 2-63, “Type E12 Class Exception Conditions.”
VGATHERQPS/VGATHERQPD—Gather Packed Single, Packed Double with Signed Qword Indices Vol. 2C 5-359
VGETEXPPD—Convert Exponents of Packed Double Precision Floating-Point Values to Double
Precision Floating-Point Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F38.W1 42 /r A V/V (AVX512VL Convert the exponent of packed double precision floating-
VGETEXPPD xmm1 {k1}{z}, AND AVX512F) point values in the source operand to double precision
xmm2/m128/m64bcst OR AVX10.11 floating-point results representing unbiased integer
exponents and stores the results in the destination register.
EVEX.256.66.0F38.W1 42 /r A V/V (AVX512VL Convert the exponent of packed double precision floating-
VGETEXPPD ymm1 {k1}{z}, AND AVX512F) point values in the source operand to double precision
ymm2/m256/m64bcst OR AVX10.11 floating-point results representing unbiased integer
exponents and stores the results in the destination register.
EVEX.512.66.0F38.W1 42 /r A V/V AVX512F Convert the exponent of packed double precision floating-
VGETEXPPD zmm1 {k1}{z}, OR AVX10.11 point values in the source operand to double precision
zmm2/m512/m64bcst{sae} floating-point results representing unbiased integer
exponents and stores the results in the destination under
writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Extracts the biased exponents from the normalized double precision floating-point representation of each qword
data element of the source operand (the second operand) as unbiased signed integer value, or convert the
denormal representation of input data to unbiased negative integer values. Each integer value of the unbiased
exponent is converted to double precision floating-point value and written to the corresponding qword elements of
the destination operand (the first operand) as double precision floating-point numbers.
The destination operand is a ZMM/YMM/XMM register and updated under the writemask. The source operand can
be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from
a 64-bit memory location.
EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Each GETEXP operation converts the exponent value into a floating-point number (permitting input value in
denormal representation). Special cases of input values are listed in Table 5-13.
The formula is:
GETEXP(x) = floor(log2(|x|))
Notation floor(x) stands for the greatest integer not exceeding real number x.
VGETEXPPD—Convert Exponents of Packed Double Precision Floating-Point Values to Double Precision Floating-Point Values Vol. 2C 5-360
Table 5-13. VGETEXPPD/SD Special Cases
Input Operand Result Comments
src1 = NaN QNaN(src1)
If (SRC = SNaN) then #IE
0 < |src1| < INF floor(log2(|src1|))
If (SRC = denormal) then #DE
| src1| = +INF +INF
| src1| = 0 -INF
Operation
NormalizeExpTinyDPFP(SRC[63:0])
{
// Jbit is the hidden integral bit of a floating-point number. In case of denormal number it has the value of ZERO.
Src.Jbit := 0;
Dst.exp := 1;
Dst.fraction := SRC[51:0];
WHILE(Src.Jbit = 0)
{
Src.Jbit := Dst.fraction[51]; // Get the fraction MSB
Dst.fraction := Dst.fraction << 1 ; // One bit shift left
Dst.exp-- ; // Decrement the exponent
}
Dst.fraction := 0; // zero out fraction bits
Dst.sign := 1; // Return negative sign
TMP[63:0] := MXCSR.DAZ? 0 : (Dst.sign << 63) OR (Dst.exp << 52) OR (Dst.fraction) ;
Return (TMP[63:0]);
}
ConvertExpDPFP(SRC[63:0])
{
Src.sign := 0; // Zero out sign bit
Src.exp := SRC[62:52];
Src.fraction := SRC[51:0];
// Check for NaN
IF (SRC = NaN)
{
IF ( SRC = SNAN ) SET IE;
Return QNAN(SRC);
}
// Check for +INF
IF (Src = +INF) RETURN (Src);
VGETEXPPD—Convert Exponents of Packed Double Precision Floating-Point Values to Double Precision Floating-Point Values Vol. 2C 5-361
{
TMP[63:0] := (Src.sign << 63) OR (Src.exp << 52) OR (Src.fraction) ;
}
TMP := SAR(TMP, 52) ; // Shift Arithmetic Right
TMP := TMP – 1023; // Subtract Bias
Return CvtI2D(TMP); // Convert INT to double precision floating-point number
}
}
VGETEXPPD—Convert Exponents of Packed Double Precision Floating-Point Values to Double Precision Floating-Point Values Vol. 2C 5-362
Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VGETEXPPD—Convert Exponents of Packed Double Precision Floating-Point Values to Double Precision Floating-Point Values Vol. 2C 5-363
VGETEXPPH—Convert Exponents of Packed FP16 Values to FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.MAP6.W0 42 /r A V/V (AVX512-FP16 Convert the exponent of FP16 values in the source
VGETEXPPH xmm1{k1}{z}, AND AVX512VL) operand to FP16 results representing unbiased
xmm2/m128/m16bcst OR AVX10.11 integer exponents and stores the results in the
destination register subject to writemask k1.
EVEX.256.66.MAP6.W0 42 /r A V/V (AVX512-FP16 Convert the exponent of FP16 values in the source
VGETEXPPH ymm1{k1}{z}, AND AVX512VL) operand to FP16 results representing unbiased
ymm2/m256/m16bcst OR AVX10.11 integer exponents and stores the results in the
destination register subject to writemask k1.
EVEX.512.66.MAP6.W0 42 /r A V/V AVX512-FP16 Convert the exponent of FP16 values in the source
VGETEXPPH zmm1{k1}{z}, OR AVX10.11 operand to FP16 results representing unbiased
zmm2/m512/m16bcst {sae} integer exponents and stores the results in the
destination register subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction extracts the biased exponents from the normalized FP16 representation of each word element of
the source operand (the second operand) as unbiased signed integer value, or convert the denormal representa-
tion of input data to unbiased negative integer values. Each integer value of the unbiased exponent is converted to
an FP16 value and written to the corresponding word elements of the destination operand (the first operand) as
FP16 numbers.
The destination elements are updated according to the writemask.
Each GETEXP operation converts the exponent value into a floating-point number (permitting input value in
denormal representation). Special cases of input values are listed in Table 5-7.
The formula is:
GETEXP(x) = floor(log2(|x|))
Notation floor(x) stands for maximal integer not exceeding real number x.
Software usage of VGETEXPxx and VGETMANTxx instructions generally involve a combination of GETEXP operation
and GETMANT operation (see VGETMANTPH). Thus, the VGETEXPPH instruction does not require software to
handle SIMD floating-point exceptions.
def getexp_fp16(src):
src.sign := 0 // make positive
exponent_all_ones := (src[14:10] == 0x1F)
exponent_all_zeros := (src[14:10] == 0)
mantissa_all_zeros := (src[9:0] == 0)
zero := exponent_all_zeros and mantissa_all_zeros
signaling_bit := src[9]
if nan:
if snan:
MXCSR.IE := 1
return qnan(src) // convert snan to a qnan
if positive_infinity:
return src
if zero:
return -INF
if denormal:
tmp := normalize_exponent_tiny_fp16(src)
MXCSR.DE := 1
else:
tmp := src
tmp := SAR(tmp, 10) // shift arithmetic right
tmp := tmp - 15 // subtract bias
return convert_integer_to_fp16(tmp)
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.fp16[0]
ELSE:
tsrc := src.fp16[i]
DEST.fp16[i] := getexp_fp16(tsrc)
ELSE IF *zeroing*:
DEST.fp16[i] := 0
//else DEST.fp16[i] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Extracts the biased exponents from the normalized single precision floating-point representation of each dword
element of the source operand (the second operand) as unbiased signed integer value, or convert the denormal
representation of input data to unbiased negative integer values. Each integer value of the unbiased exponent is
converted to single precision floating-point value and written to the corresponding dword elements of the destina-
tion operand (the first operand) as single precision floating-point numbers.
The destination operand is a ZMM/YMM/XMM register and updated under the writemask. The source operand can
be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from
a 32-bit memory location.
EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Each GETEXP operation converts the exponent value into a floating-point number (permitting input value in
denormal representation). Special cases of input values are listed in Table 5-15.
The formula is:
GETEXP(x) = floor(log2(|x|))
Notation floor(x) stands for maximal integer not exceeding real number x.
Software usage of VGETEXPxx and VGETMANTxx instructions generally involve a combination of GETEXP operation
and GETMANT operation (see VGETMANTPD). Thus VGETEXPxx instruction do not require software to handle SIMD
floating-point exceptions.
VGETEXPPS—Convert Exponents of Packed Single Precision Floating-Point Values to Single Precision Floating-Point Values Vol. 2C 5-366
Table 5-15. VGETEXPPS/SS Special Cases
Input Operand Result Comments
src1 = NaN QNaN(src1)
If (SRC = SNaN) then #IE
0 < |src1| < INF floor(log2(|src1|))
If (SRC = denormal) then #DE
| src1| = +INF +INF
| src1| = 0 -INF
Figure 5-14 illustrates the VGETEXPPS functionality on input values with normalized representation.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
s exp Fraction
Src = 2^1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-Bias 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1
Tmp - Bias = 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Cvt_PI2PS(01h) = 2^0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Operation
NormalizeExpTinySPFP(SRC[31:0])
{
// Jbit is the hidden integral bit of a floating-point number. In case of denormal number it has the value of ZERO.
Src.Jbit := 0;
Dst.exp := 1;
Dst.fraction := SRC[22:0];
WHILE(Src.Jbit = 0)
{
Src.Jbit := Dst.fraction[22]; // Get the fraction MSB
Dst.fraction := Dst.fraction << 1 ; // One bit shift left
Dst.exp-- ; // Decrement the exponent
}
Dst.fraction := 0; // zero out fraction bits
Dst.sign := 1; // Return negative sign
TMP[31:0] := MXCSR.DAZ? 0 : (Dst.sign << 31) OR (Dst.exp << 23) OR (Dst.fraction) ;
Return (TMP[31:0]);
}
ConvertExpSPFP(SRC[31:0])
{
Src.sign := 0; // Zero out sign bit
Src.exp := SRC[30:23];
Src.fraction := SRC[22:0];
// Check for NaN
IF (SRC = NaN)
{
IF ( SRC = SNAN ) SET IE;
VGETEXPPS—Convert Exponents of Packed Single Precision Floating-Point Values to Single Precision Floating-Point Values Vol. 2C 5-367
Return QNAN(SRC);
}
// Check for +INF
IF (Src = +INF) RETURN (Src);
VGETEXPPS—Convert Exponents of Packed Single Precision Floating-Point Values to Single Precision Floating-Point Values Vol. 2C 5-368
Intel C/C++ Compiler Intrinsic Equivalent
VGETEXPPS __m512 _mm512_getexp_ps( __m512 a);
VGETEXPPS __m512 _mm512_mask_getexp_ps(__m512 s, __mmask16 k, __m512 a);
VGETEXPPS __m512 _mm512_maskz_getexp_ps( __mmask16 k, __m512 a);
VGETEXPPS __m512 _mm512_getexp_round_ps( __m512 a, int sae);
VGETEXPPS __m512 _mm512_mask_getexp_round_ps(__m512 s, __mmask16 k, __m512 a, int sae);
VGETEXPPS __m512 _mm512_maskz_getexp_round_ps( __mmask16 k, __m512 a, int sae);
VGETEXPPS __m256 _mm256_getexp_ps(__m256 a);
VGETEXPPS __m256 _mm256_mask_getexp_ps(__m256 s, __mmask8 k, __m256 a);
VGETEXPPS __m256 _mm256_maskz_getexp_ps( __mmask8 k, __m256 a);
VGETEXPPS __m128 _mm_getexp_ps(__m128 a);
VGETEXPPS __m128 _mm_mask_getexp_ps(__m128 s, __mmask8 k, __m128 a);
VGETEXPPS __m128 _mm_maskz_getexp_ps( __mmask8 k, __m128 a);
Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VGETEXPPS—Convert Exponents of Packed Single Precision Floating-Point Values to Single Precision Floating-Point Values Vol. 2C 5-369
VGETEXPSD—Convert Exponents of Scalar Double Precision Floating-Point Value to Double
Precision Floating-Point Value
Opcode/ Op/ 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
EVEX.LLIG.66.0F38.W1 43 /r A V/V AVX512F Convert the biased exponent (bits 62:52) of the low
VGETEXPSD xmm1 {k1}{z}, OR AVX10.11 double precision floating-point value in xmm3/m64 to a
xmm2, xmm3/m64{sae} double precision floating-point value representing
unbiased integer exponent. Stores the result to the low
64-bit of xmm1 under the writemask k1 and merge with
the other elements of xmm2.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Extracts the biased exponent from the normalized double precision floating-point representation of the low qword
data element of the source operand (the third operand) as unbiased signed integer value, or convert the denormal
representation of input data to unbiased negative integer values. The integer value of the unbiased exponent is
converted to double precision floating-point value and written to the destination operand (the first operand) as
double precision floating-point numbers. Bits (127:64) of the XMM register destination are copied from corre-
sponding bits in the first source operand.
The destination must be a XMM register, the source operand can be a XMM register or a float64 memory location.
If writemasking is used, the low quadword element of the destination operand is conditionally updated depending
on the value of writemask register k1. If writemasking is not used, the low quadword element of the destination
operand is unconditionally updated.
Each GETEXP operation converts the exponent value into a floating-point number (permitting input value in
denormal representation). Special cases of input values are listed in Table 5-13.
The formula is:
GETEXP(x) = floor(log2(|x|))
Notation floor(x) stands for maximal integer not exceeding real number x.
Operation
// NormalizeExpTinyDPFP(SRC[63:0]) is defined in the Operation section of VGETEXPPD
VGETEXPSD—Convert Exponents of Scalar Double Precision Floating-Point Value to Double Precision Floating-Point Value Vol. 2C 5-370
VGETEXPSD (EVEX encoded version)
IF k1[0] OR *no writemask*
THEN DEST[63:0] :=
ConvertExpDPFP(SRC2[63:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
DEST[63:0] := 0
FI
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0
Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”
VGETEXPSD—Convert Exponents of Scalar Double Precision Floating-Point Value to Double Precision Floating-Point Value Vol. 2C 5-371
VGETEXPSH—Convert Exponents of Scalar FP16 Values to FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.LLIG.66.MAP6.W0 43 /r A V/V AVX512-FP16 Convert the exponent of FP16 values in the low
VGETEXPSH xmm1{k1}{z}, xmm2, OR AVX10.11 word of the source operand to FP16 results
xmm3/m16 {sae} representing unbiased integer exponents, and stores
the results in the low word of the destination
register subject to writemask k1. Bits 127:16 of
xmm2 are copied to xmm1[127:16].
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction extracts the biased exponents from the normalized FP16 representation of the low word element of
the source operand (the second operand) as unbiased signed integer value, or convert the denormal representa-
tion of input data to an unbiased negative integer value. The integer value of the unbiased exponent is converted
to an FP16 value and written to the low word element of the destination operand (the first operand) as an FP16
number.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
Each GETEXP operation converts the exponent value into a floating-point number (permitting input value in
denormal representation). Special cases of input values are listed in Table 5-14.
The formula is:
GETEXP(x) = floor(log2(|x|))
Notation floor(x) stands for maximal integer not exceeding real number x.
Software usage of VGETEXPxx and VGETMANTxx instructions generally involve a combination of GETEXP operation
and GETMANT operation (see VGETMANTSH). Thus, the VGETEXPSH instruction does not require software to
handle SIMD floating-point exceptions.
Operation
VGETEXPSH dest{k1}, src1, src2
IF k1[0] or *no writemask*:
DEST.fp16[0] := getexp_fp16(src2.fp16[0]) // see VGETEXPPH
ELSE IF *zeroing*:
DEST.fp16[0] := 0
//else DEST.fp16[0] remains unchanged
DEST[127:16] := src1[127:16]
DEST[MAXVL-1:128] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Extracts the biased exponent from the normalized single precision floating-point representation of the low double-
word data element of the source operand (the third operand) as unbiased signed integer value, or convert the
denormal representation of input data to unbiased negative integer values. The integer value of the unbiased expo-
nent is converted to single precision floating-point value and written to the destination operand (the first operand)
as single precision floating-point numbers. Bits (127:32) of the XMM register destination are copied from corre-
sponding bits in the first source operand.
The destination must be a XMM register, the source operand can be a XMM register or a float32 memory location.
If writemasking is used, the low doubleword element of the destination operand is conditionally updated depending
on the value of writemask register k1. If writemasking is not used, the low doubleword element of the destination
operand is unconditionally updated.
Each GETEXP operation converts the exponent value into a floating-point number (permitting input value in
denormal representation). Special cases of input values are listed in Table 5-15.
The formula is:
GETEXP(x) = floor(log2(|x|))
Notation floor(x) stands for maximal integer not exceeding real number x.
Software usage of VGETEXPxx and VGETMANTxx instructions generally involve a combination of GETEXP operation
and GETMANT operation (see VGETMANTPD). Thus VGETEXPxx instruction do not require software to handle SIMD
floating-point exceptions.
Operation
// NormalizeExpTinySPFP(SRC[31:0]) is defined in the Operation section of VGETEXPPS
// ConvertExpSPFP(SRC[31:0]) is defined in the Operation section of VGETEXPPS
VGETEXPSS—Convert Exponents of Scalar Single Precision Floating-Point Value to Single Precision Floating-Point Value Vol. 2C 5-374
VGETEXPSS (EVEX encoded version)
IF k1[0] OR *no writemask*
THEN DEST[31:0] :=
ConvertExpDPFP(SRC2[31:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
DEST[31:0]:= 0
FI
FI;
ENDFOR
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0
Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”
VGETEXPSS—Convert Exponents of Scalar Single Precision Floating-Point Value to Single Precision Floating-Point Value Vol. 2C 5-375
VGETMANTPD—Extract Float64 Vector of Normalized Mantissas From Float64 Vector
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F3A.W1 26 /r ib A V/V (AVX512VL Get Normalized Mantissa from float64 vector
VGETMANTPD xmm1 {k1}{z}, AND AVX512F) xmm2/m128/m64bcst and store the result in xmm1,
xmm2/m128/m64bcst, imm8 OR AVX10.11 using imm8 for sign control and mantissa interval
normalization, under writemask.
EVEX.256.66.0F3A.W1 26 /r ib A V/V (AVX512VL Get Normalized Mantissa from float64 vector
VGETMANTPD ymm1 {k1}{z}, AND AVX512F) ymm2/m256/m64bcst and store the result in ymm1,
ymm2/m256/m64bcst, imm8 OR AVX10.11 using imm8 for sign control and mantissa interval
normalization, under writemask.
EVEX.512.66.0F3A.W1 26 /r ib A V/V AVX512F Get Normalized Mantissa from float64 vector
VGETMANTPD zmm1 {k1}{z}, OR AVX10.11 zmm2/m512/m64bcst and store the result in zmm1,
zmm2/m512/m64bcst{sae}, using imm8 for sign control and mantissa interval
imm8 normalization, under writemask.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Convert double precision floating values in the source operand (the second operand) to double precision floating-
point values with the mantissa normalization and sign control specified by the imm8 byte, see Figure 5-15. The
converted results are written to the destination operand (the first operand) using writemask k1. The normalized
mantissa is specified by interv (imm8[1:0]) and the sign control (sc) is specified by bits 3:2 of the immediate byte.
The destination operand is a ZMM/YMM/XMM register updated under the writemask. The source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 64-
bit memory location.
7 6 5 4 3 2 1 0
VGETMANTPD—Extract Float64 Vector of Normalized Mantissas From Float64 Vector Vol. 2C 5-376
For each input double precision floating-point value x, The conversion operation is:
GetMant(x) = ±2k|x.significand|
where:
1 <= |x.significand| < 2
Unbiased exponent k can be either 0 or -1, depending on the interval range defined by interv, the range of the
significand and whether the exponent of the source is even or odd. The sign of the final result is determined by sc
and the source sign. The encoded value of imm8[1:0] and sign control are shown in Figure 5-15.
Each converted double precision floating-point result is encoded according to the sign control, the unbiased expo-
nent k (adding bias) and a mantissa normalized to the range specified by interv.
The GetMant() function follows Table 5-16 when dealing with floating-point special numbers.
This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1
are computed and stored into the destination. Elements in zmm1 with the corresponding bit clear in k1 retain their
previous values.
Note: EVEX.vvvv is reserved and must be 1111b; otherwise instructions will #UD.
Operation
def getmant_fp64(src, sign_control, normalization_interval):
bias := 1023
dst.sign := sign_control[0] ? 0 : src.sign
signed_one := sign_control[0] ? +1.0 : -1.0
dst.exp := src.exp
dst.fraction := src.fraction
zero := (dst.exp = 0) and ((dst.fraction = 0) or (MXCSR.DAZ=1))
denormal := (dst.exp = 0) and (dst.fraction != 0) and (MXCSR.DAZ=0)
infinity := (dst.exp = 0x7FF) and (dst.fraction = 0)
nan := (dst.exp = 0x7FF) and (dst.fraction != 0)
src_signaling := src.fraction[51]
snan := nan and (src_signaling = 0)
positive := (src.sign = 0)
negative := (src.sign = 1)
if nan:
VGETMANTPD—Extract Float64 Vector of Normalized Mantissas From Float64 Vector Vol. 2C 5-377
if snan:
MXCSR.IE := 1
return qnan(src)
if denormal:
jbit := 0
dst.exp := bias
while jbit = 0:
jbit := dst.fraction[51]
dst.fraction := dst.fraction << 1
dst.exp : = dst.exp - 1
MXCSR.DE := 1
VGETMANTPD—Extract Float64 Vector of Normalized Mantissas From Float64 Vector Vol. 2C 5-378
VGETMANTPD (EVEX Encoded Versions)
VGETMANTPD dest{k1}, src, imm8
VL = 128, 256, or 512
KL := VL / 64
sign_control := imm8[3:2]
normalization_interval := imm8[1:0]
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.double[0]
ELSE:
tsrc := src.double[i]
DEST.double[i] := getmant_fp64(tsrc, sign_control, normalization_interval)
ELSE IF *zeroing*:
DEST.double[i] := 0
//else DEST.double[i] remains unchanged
DEST[MAX_VL-1:VL] := 0
Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VGETMANTPD—Extract Float64 Vector of Normalized Mantissas From Float64 Vector Vol. 2C 5-379
VGETMANTPH—Extract FP16 Vector of Normalized Mantissas from FP16 Vector
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.NP.0F3A.W0 26 /r /ib A V/V (AVX512-FP16 Get normalized mantissa from FP16 vector
VGETMANTPH xmm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst and store the result in
xmm2/m128/m16bcst, imm8 OR AVX10.11 xmm1, using imm8 for sign control and mantissa
interval normalization, subject to writemask k1.
EVEX.256.NP.0F3A.W0 26 /r /ib A V/V (AVX512-FP16 Get normalized mantissa from FP16 vector
VGETMANTPH ymm1{k1}{z}, AND AVX512VL) ymm2/m256/m16bcst and store the result in
ymm2/m256/m16bcst, imm8 OR AVX10.11 ymm1, using imm8 for sign control and mantissa
interval normalization, subject to writemask k1.
EVEX.512.NP.0F3A.W0 26 /r /ib A V/V AVX512-FP16 Get normalized mantissa from FP16 vector
VGETMANTPH zmm1{k1}{z}, OR AVX10.11 zmm2/m512/m16bcst and store the result in
zmm2/m512/m16bcst {sae}, imm8 zmm1, using imm8 for sign control and mantissa
interval normalization, subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts the FP16 values in the source operand (the second operand) to FP16 values with the
mantissa normalization and sign control specified by the imm8 byte, see Table 5-17. The converted results are
written to the destination operand (the first operand) using writemask k1. The normalized mantissa is specified by
interv (imm8[1:0]) and the sign control (SC) is specified by bits 3:2 of the immediate byte.
The destination elements are updated according to the writemask.
GetMant(x) = ±2k|x.significand|
where:
VGETMANTPH—Extract FP16 Vector of Normalized Mantissas from FP16 Vector Vol. 2C 5-380
1 ≤ |x.significand| < 2
Unbiased exponent k depends on the interval range defined by interv and whether the exponent of the source is
even or odd. The sign of the final result is determined by the sign control and the source sign and the leading frac-
tion bit.
The encoded value of imm8[1:0] and sign control are shown in Table 5-17.
Each converted FP16 result is encoded according to the sign control, the unbiased exponent k (adding bias) and a
mantissa normalized to the range specified by interv.
The GetMant() function follows Table 5-18 when dealing with floating-point special numbers.
Operation
def getmant_fp16(src, sign_control, normalization_interval):
bias := 15
dst.sign := sign_control[0] ? 0 : src.sign
signed_one := sign_control[0] ? +1.0 : -1.0
dst.exp := src.exp
dst.fraction := src.fraction
zero := (dst.exp = 0) and (dst.fraction = 0)
denormal := (dst.exp = 0) and (dst.fraction != 0)
infinity := (dst.exp = 0x1F) and (dst.fraction = 0)
nan := (dst.exp = 0x1F) and (dst.fraction != 0)
src_signaling := src.fraction[9]
snan := nan and (src_signaling = 0)
positive := (src.sign = 0)
negative := (src.sign = 1)
if nan:
if snan:
MXCSR.IE := 1
return qnan(src)
VGETMANTPH—Extract FP16 Vector of Normalized Mantissas from FP16 Vector Vol. 2C 5-381
return signed_one
if infinity:
if sign_control[1]:
MXCSR.IE := 1
return QNaN_Indefinite
return signed_one
if sign_control[1]:
MXCSR.IE := 1
return QNaN_Indefinite
if denormal:
jbit := 0
dst.exp := bias // set exponent to bias value
while jbit = 0:
jbit := dst.fraction[9]
dst.fraction := dst.fraction << 1
dst.exp : = dst.exp - 1
MXCSR.DE := 1
sign_control := imm8[3:2]
normalization_interval := imm8[1:0]
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.fp16[0]
ELSE:
tsrc := src.fp16[i]
DEST.fp16[i] := getmant_fp16(tsrc, sign_control, normalization_interval)
ELSE IF *zeroing*:
DEST.fp16[i] := 0
//else DEST.fp16[i] remains unchanged
DEST[MAXVL-1:VL] := 0
VGETMANTPH—Extract FP16 Vector of Normalized Mantissas from FP16 Vector Vol. 2C 5-382
Intel C/C++ Compiler Intrinsic Equivalent
VGETMANTPH __m128h _mm_getmant_ph (__m128h a, _MM_MANTISSA_NORM_ENUM norm, _MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m128h _mm_mask_getmant_ph (__m128h src, __mmask8 k, __m128h a, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m128h _mm_maskz_getmant_ph (__mmask8 k, __m128h a, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m256h _mm256_getmant_ph (__m256h a, _MM_MANTISSA_NORM_ENUM norm, _MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m256h _mm256_mask_getmant_ph (__m256h src, __mmask16 k, __m256h a, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m256h _mm256_maskz_getmant_ph (__mmask16 k, __m256h a, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m512h _mm512_getmant_ph (__m512h a, _MM_MANTISSA_NORM_ENUM norm, _MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m512h _mm512_mask_getmant_ph (__m512h src, __mmask32 k, __m512h a, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m512h _mm512_maskz_getmant_ph (__mmask32 k, __m512h a, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign);
VGETMANTPH __m512h _mm512_getmant_round_ph (__m512h a, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign, const int sae);
VGETMANTPH __m512h _mm512_mask_getmant_round_ph (__m512h src, __mmask32 k, __m512h a, _MM_MANTISSA_NORM_ENUM
norm, _MM_MANTISSA_SIGN_ENUM sign, const int sae);
VGETMANTPH __m512h _mm512_maskz_getmant_round_ph (__mmask32 k, __m512h a, _MM_MANTISSA_NORM_ENUM norm,
_MM_MANTISSA_SIGN_ENUM sign, const int sae);
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
VGETMANTPH—Extract FP16 Vector of Normalized Mantissas from FP16 Vector Vol. 2C 5-383
VGETMANTPS—Extract Float32 Vector of Normalized Mantissas From Float32 Vector
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
EVEX.128.66.0F3A.W0 26 /r ib A V/V (AVX512VL Get normalized mantissa from float32 vector
VGETMANTPS xmm1 {k1}{z}, AND AVX512F) xmm2/m128/m32bcst and store the result in xmm1, using
xmm2/m128/m32bcst, imm8 OR AVX10.11 imm8 for sign control and mantissa interval normalization,
under writemask.
EVEX.256.66.0F3A.W0 26 /r ib A V/V (AVX512VL Get normalized mantissa from float32 vector
VGETMANTPS ymm1 {k1}{z}, AND AVX512F) ymm2/m256/m32bcst and store the result in ymm1, using
ymm2/m256/m32bcst, imm8 OR AVX10.11 imm8 for sign control and mantissa interval normalization,
under writemask.
EVEX.512.66.0F3A.W0 26 /r ib A V/V AVX512F Get normalized mantissa from float32 vector
VGETMANTPS zmm1 {k1}{z}, OR AVX10.11 zmm2/m512/m32bcst and store the result in zmm1, using
zmm2/m512/m32bcst{sae}, imm8 for sign control and mantissa interval normalization,
imm8 under writemask.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Convert single precision floating values in the source operand (the second operand) to single precision floating-
point values with the mantissa normalization and sign control specified by the imm8 byte, see Figure 5-15. The
converted results are written to the destination operand (the first operand) using writemask k1. The normalized
mantissa is specified by interv (imm8[1:0]) and the sign control (sc) is specified by bits 3:2 of the immediate byte.
The destination operand is a ZMM/YMM/XMM register updated under the writemask. The source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32-
bit memory location.
For each input single precision floating-point value x, The conversion operation is:
GetMant(x) = ±2k|x.significand|
where:
Unbiased exponent k can be either 0 or -1, depending on the interval range defined by interv, the range of the
significand and whether the exponent of the source is even or odd. The sign of the final result is determined by sc
and the source sign. The encoded value of imm8[1:0] and sign control are shown in Figure 5-15.
Each converted single precision floating-point result is encoded according to the sign control, the unbiased expo-
nent k (adding bias) and a mantissa normalized to the range specified by interv.
The GetMant() function follows Table 5-16 when dealing with floating-point special numbers.
This instruction is writemasked, so only those elements with the corresponding bit set in vector mask register k1
are computed and stored into the destination. Elements in zmm1 with the corresponding bit clear in k1 retain their
previous values.
Note: EVEX.vvvv is reserved and must be 1111b, VEX.L must be 0; otherwise instructions will #UD.
VGETMANTPS—Extract Float32 Vector of Normalized Mantissas From Float32 Vector Vol. 2C 5-384
Operation
def getmant_fp32(src, sign_control, normalization_interval):
bias := 127
dst.sign := sign_control[0] ? 0 : src.sign
signed_one := sign_control[0] ? +1.0 : -1.0
dst.exp := src.exp
dst.fraction := src.fraction
zero := (dst.exp = 0) and ((dst.fraction = 0) or (MXCSR.DAZ=1))
denormal := (dst.exp = 0) and (dst.fraction != 0) and (MXCSR.DAZ=0)
infinity := (dst.exp = 0xFF) and (dst.fraction = 0)
nan := (dst.exp = 0xFF) and (dst.fraction != 0)
src_signaling := src.fraction[22]
snan := nan and (src_signaling = 0)
positive := (src.sign = 0)
negative := (src.sign = 1)
if nan:
if snan:
MXCSR.IE := 1
return qnan(src)
if denormal:
jbit := 0
dst.exp := bias
while jbit = 0:
jbit := dst.fraction[22]
dst.fraction := dst.fraction << 1
dst.exp : = dst.exp - 1
MXCSR.DE := 1
VGETMANTPS—Extract Float32 Vector of Normalized Mantissas From Float32 Vector Vol. 2C 5-385
return dst
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.float[0]
ELSE:
tsrc := src.float[i]
DEST.float[i] := getmant_fp32(tsrc, sign_control, normalization_interval)
ELSE IF *zeroing*:
DEST.float[i] := 0
//else DEST.float[i] remains unchanged
DEST[MAX_VL-1:VL] := 0
Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VGETMANTPS—Extract Float32 Vector of Normalized Mantissas From Float32 Vector Vol. 2C 5-386
VGETMANTSD—Extract Float64 of Normalized Mantissa From Float64 Scalar
Opcode/ Op/ 64/32 CPUID Description
Instruction En Bit Mode Feature Flag
Support
EVEX.LLIG.66.0F3A.W1 27 /r ib A V/V AVX512F Extract the normalized mantissa of the low float64
VGETMANTSD xmm1 {k1}{z}, xmm2, OR AVX10.11 element in xmm3/m64 using imm8 for sign control
xmm3/m64{sae}, imm8 and mantissa interval normalization. Store the
mantissa to xmm1 under the writemask k1 and
merge with the other elements of xmm2.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Convert the double precision floating values in the low quadword element of the second source operand (the third
operand) to double precision floating-point value with the mantissa normalization and sign control specified by the
imm8 byte, see Figure 5-15. The converted result is written to the low quadword element of the destination
operand (the first operand) using writemask k1. Bits (127:64) of the XMM register destination are copied from
corresponding bits in the first source operand. The normalized mantissa is specified by interv (imm8[1:0]) and the
sign control (sc) is specified by bits 3:2 of the immediate byte.
The conversion operation is:
GetMant(x) = ±2k|x.significand|
where:
Unbiased exponent k can be either 0 or -1, depending on the interval range defined by interv, the range of the
significand and whether the exponent of the source is even or odd. The sign of the final result is determined by sc
and the source sign. The encoded value of imm8[1:0] and sign control are shown in Figure 5-15.
The converted double precision floating-point result is encoded according to the sign control, the unbiased expo-
nent k (adding bias) and a mantissa normalized to the range specified by interv.
The GetMant() function follows Table 5-16 when dealing with floating-point special numbers.
If writemasking is used, the low quadword element of the destination operand is conditionally updated depending
on the value of writemask register k1. If writemasking is not used, the low quadword element of the destination
operand is unconditionally updated.
Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction converts the FP16 value in the low element of the second source operand to FP16 values with the
mantissa normalization and sign control specified by the imm8 byte, see Table 5-17. The converted result is written
to the low element of the destination operand using writemask k1. The normalized mantissa is specified by interv
(imm8[1:0]) and the sign control (SC) is specified by bits 3:2 of the immediate byte.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
For each input FP16 value x, The conversion operation is:
GetMant(x) = ±2k|x.significand|
where:
1 ≤ |x.significand| < 2
Unbiased exponent k depends on the interval range defined by interv and whether the exponent of the source is
even or odd. The sign of the final result is determined by the sign control and the source sign and the leading frac-
tion bit.
The encoded value of imm8[1:0] and sign control are shown in Table 5-17.
Each converted FP16 result is encoded according to the sign control, the unbiased exponent k (adding bias) and a
mantissa normalized to the range specified by interv.
The GetMant() function follows Table 5-18 when dealing with floating-point special numbers.
DEST[127:16] := src1[127:16]
DEST[MAXVL-1:128] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Convert the single precision floating values in the low doubleword element of the second source operand (the third
operand) to single precision floating-point value with the mantissa normalization and sign control specified by the
imm8 byte, see Figure 5-15. The converted result is written to the low doubleword element of the destination
operand (the first operand) using writemask k1. Bits (127:32) of the XMM register destination are copied from
corresponding bits in the first source operand. The normalized mantissa is specified by interv (imm8[1:0]) and the
sign control (sc) is specified by bits 3:2 of the immediate byte.
The conversion operation is:
GetMant(x) = ±2k|x.significand|
where:
1 <= |x.significand| < 2
Unbiased exponent k can be either 0 or -1, depending on the interval range defined by interv, the range of the
significand and whether the exponent of the source is even or odd. The sign of the final result is determined by sc
and the source sign. The encoded value of imm8[1:0] and sign control are shown in Figure 5-15.
The converted single precision floating-point result is encoded according to the sign control, the unbiased exponent
k (adding bias) and a mantissa normalized to the range specified by interv.
The GetMant() function follows Table 5-16 when dealing with floating-point special numbers.
If writemasking is used, the low doubleword element of the destination operand is conditionally updated depending
on the value of writemask register k1. If writemasking is not used, the low doubleword element of the destination
operand is unconditionally updated.
VGETMANTSS—Extract Float32 Vector of Normalized Mantissa From Float32 Scalar Vol. 2C 5-391
Operation
// getmant_fp32(src, sign_control, normalization_interval) is defined in the operation section of VGETMANTPS
Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”
VGETMANTSS—Extract Float32 Vector of Normalized Mantissa From Float32 Scalar Vol. 2C 5-392
VINSERTF128/VINSERTF32x4/VINSERTF64x2/VINSERTF32x8/VINSERTF64x4—Insert Packed
Floating-Point Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.256.66.0F3A.W0 18 /r ib A V/V AVX Insert 128 bits of packed floating-point values
VINSERTF128 ymm1, ymm2, from xmm3/m128 and the remaining values
xmm3/m128, imm8 from ymm2 into ymm1.
EVEX.256.66.0F3A.W0 18 /r ib C V/V (AVX512VL AND Insert 128 bits of packed single-precision
VINSERTF32X4 ymm1 {k1}{z}, ymm2, AVX512F) OR floating-point values from xmm3/m128 and the
xmm3/m128, imm8 AVX10.11 remaining values from ymm2 into ymm1 under
writemask k1.
EVEX.512.66.0F3A.W0 18 /r ib C V/V AVX512F Insert 128 bits of packed single-precision
VINSERTF32X4 zmm1 {k1}{z}, zmm2, OR AVX10.11 floating-point values from xmm3/m128 and the
xmm3/m128, imm8 remaining values from zmm2 into zmm1 under
writemask k1.
EVEX.256.66.0F3A.W1 18 /r ib B V/V (AVX512VL AND Insert 128 bits of packed double precision
VINSERTF64X2 ymm1 {k1}{z}, ymm2, AVX512DQ) OR floating-point values from xmm3/m128 and the
xmm3/m128, imm8 AVX10.11 remaining values from ymm2 into ymm1 under
writemask k1.
EVEX.512.66.0F3A.W1 18 /r ib B V/V AVX512DQ Insert 128 bits of packed double precision
VINSERTF64X2 zmm1 {k1}{z}, zmm2, OR AVX10.11 floating-point values from xmm3/m128 and the
xmm3/m128, imm8 remaining values from zmm2 into zmm1 under
writemask k1.
EVEX.512.66.0F3A.W0 1A /r ib D V/V AVX512DQ Insert 256 bits of packed single-precision
VINSERTF32X8 zmm1 {k1}{z}, zmm2, OR AVX10.11 floating-point values from ymm3/m256 and the
ymm3/m256, imm8 remaining values from zmm2 into zmm1 under
writemask k1.
EVEX.512.66.0F3A.W1 1A /r ib C V/V AVX512F Insert 256 bits of packed double precision
VINSERTF64X4 zmm1 {k1}{z}, zmm2, OR AVX10.11 floating-point values from ymm3/m256 and the
ymm3/m256, imm8 remaining values from zmm2 into zmm1 under
writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VINSERTF128/VINSERTF32x4 and VINSERTF64x2 insert 128-bits of packed floating-point values from the second
source operand (the third operand) into the destination operand (the first operand) at an 128-bit granularity offset
multiplied by imm8[0] (256-bit) or imm8[1:0]. The remaining portions of the destination operand are copied from
the corresponding fields of the first source operand (the second operand). The second source operand can be either
an XMM register or a 128-bit memory location. The destination and first source operands are vector registers.
Operation
VINSERTF32x4 (EVEX encoded versions)
(KL, VL) = (8, 256), (16, 512)
TEMP_DEST[VL-1:0] := SRC1[VL-1:0]
IF VL = 256
CASE (imm8[0]) OF
0: TMP_DEST[127:0] := SRC2[127:0]
1: TMP_DEST[255:128] := SRC2[127:0]
ESAC.
FI;
IF VL = 512
CASE (imm8[1:0]) OF
00: TMP_DEST[127:0] := SRC2[127:0]
01: TMP_DEST[255:128] := SRC2[127:0]
10: TMP_DEST[383:256] := SRC2[127:0]
11: TMP_DEST[511:384] := SRC2[127:0]
ESAC.
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO 15
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO 7
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
VEX-encoded instruction, see Table 2-23, “Type 6 Class Exception Conditions.”
Additionally:
#UD If VEX.L = 0.
EVEX-encoded instruction, see Table 2-56, “Type E6NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VINSERTI32x4 and VINSERTI64x2 inserts 128-bits of packed integer values from the second source operand (the
third operand) into the destination operand (the first operand) at an 128-bit granular offset multiplied by imm8[0]
(256-bit) or imm8[1:0]. The remaining portions of the destination are copied from the corresponding fields of the
first source operand (the second operand). The second source operand can be either an XMM register or a 128-bit
memory location. The high 6/7bits of the immediate are ignored. The destination operand is a ZMM/YMM register
and updated at 32 and 64-bit granularity according to the writemask.
Operation
VINSERTI32x4 (EVEX encoded versions)
(KL, VL) = (8, 256), (16, 512)
TEMP_DEST[VL-1:0] := SRC1[VL-1:0]
IF VL = 256
CASE (imm8[0]) OF
0: TMP_DEST[127:0] := SRC2[127:0]
1: TMP_DEST[255:128] := SRC2[127:0]
ESAC.
FI;
IF VL = 512
CASE (imm8[1:0]) OF
00: TMP_DEST[127:0] := SRC2[127:0]
01: TMP_DEST[255:128] := SRC2[127:0]
10: TMP_DEST[383:256] := SRC2[127:0]
11: TMP_DEST[511:384] := SRC2[127:0]
ESAC.
FI;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO 15
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TMP_DEST[i+31:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO 7
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VINSERTI128
TEMP[255:0] := SRC1[255:0]
CASE (imm8[0]) OF
0: TEMP[127:0] := SRC2[127:0]
1: TEMP[255:128] := SRC2[127:0]
ESAC
DEST := TEMP
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a SIMD compare of the packed FP16 values in the first source operand and the second
source operand and returns the maximum value for each pair of values to the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that
is, a QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source
operand (from either the first or second operand) be returned, the action of VMAXPH can be emulated using a
sequence of instructions, such as, a comparison followed by AND, ANDN and OR.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcast from a 16-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
Operation
def MAX(SRC1, SRC2):
IF (SRC1 = 0.0) and (SRC2 = 0.0):
DEST := SRC2
ELSE IF (SRC1 = NaN):
DEST := SRC2
ELSE IF (SRC2 = NaN):
DEST := SRC2
ELSE IF (SRC1 > SRC2):
DEST := SRC1
ELSE:
DEST := SRC2
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
tsrc2 := SRC2.fp16[0]
ELSE:
tsrc2 := SRC2.fp16[j]
DEST.fp16[j] := MAX(SRC1.fp16[j], tsrc2)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a compare of the low packed FP16 values in the first source operand and the second
source operand and returns the maximum value for the pair of values to the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that
is, a QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source
operand (from either the first or second operand) be returned, the action of VMAXSH can be emulated using a
sequence of instructions, such as, a comparison followed by AND, ANDN, and OR.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
Operation
def MAX(SRC1, SRC2):
IF (SRC1 = 0.0) and (SRC2 = 0.0):
DEST := SRC2
ELSE IF (SRC1 = NaN):
DEST := SRC2
ELSE IF (SRC2 = NaN):
DEST := SRC2
ELSE IF (SRC1 > SRC2):
DEST := SRC1
ELSE:
DEST := SRC2
DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:128] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a SIMD compare of the packed FP16 values in the first source operand and the second
source operand and returns the minimum value for each pair of values to the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that
is, a QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source
operand (from either the first or second operand) be returned, the action of VMINPH can be emulated using a
sequence of instructions, such as, a comparison followed by AND, ANDN and OR.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcast from a 16-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
Operation
def MIN(SRC1, SRC2):
IF (SRC1 = 0.0) and (SRC2 = 0.0):
DEST := SRC2
ELSE IF (SRC1 = NaN):
DEST := SRC2
ELSE IF (SRC2 = NaN):
DEST := SRC2
ELSE IF (SRC1 < SRC2):
DEST := SRC1
ELSE:
DEST := SRC2
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
tsrc2 := SRC2.fp16[0]
ELSE:
tsrc2 := SRC2.fp16[j]
DEST.fp16[j] := MIN(SRC1.fp16[j], tsrc2)
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a compare of the low packed FP16 values in the first source operand and the second
source operand and returns the minimum value for the pair of values to the destination operand.
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is
returned. If a value in the second operand is an SNaN, then SNaN is forwarded unchanged to the destination (that
is, a QNaN version of the SNaN is not returned).
If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN
or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source
operand (from either the first or second operand) be returned, the action of VMINSH can be emulated using a
sequence of instructions, such as, a comparison followed by AND, ANDN, and OR.
EVEX encoded versions: The first source operand (the second operand) is a ZMM/YMM/XMM register. The second
source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector
broadcast from a 16-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally
updated with writemask k1.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
Operation
def MIN(SRC1, SRC2):
IF (SRC1 = 0.0) and (SRC2 = 0.0):
DEST := SRC2
ELSE IF (SRC1 = NaN):
DEST := SRC2
ELSE IF (SRC2 = NaN):
DEST := SRC2
ELSE IF (SRC1 < SRC2):
DEST := SRC1
ELSE:
DEST := SRC2
DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:128] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction moves a FP16 value to a register or memory location.
The two register-only forms are aliases and differ only in where their operands are encoded; this is a side effect of
the encodings selected.
Operation
VMOVSH dest, src (two operand load)
IF k1[0] or no writemask:
DEST.fp16[0] := SRC.fp16[0]
ELSE IF *zeroing*:
DEST.fp16[0] := 0
// ELSE DEST.fp16[0] remains unchanged
DEST[MAXVL:16] := 0
DEST[127:16] := SRC1[127:16]
DEST[MAXVL:128] := 0
Other Exceptions
EVEX-encoded instruction, see Table 2-53, “Type E5 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction either (a) copies one word element from an XMM register to a general-purpose register or memory
location or (b) copies one word element from a general-purpose register or memory location to an XMM register.
When writing a general-purpose register, the lower 16-bits of the register will contain the word value. The upper bits
of the general-purpose register are written with zeros.
Operation
VMOVW dest, src (two operand load)
DEST.word[0] := SRC.word[0]
DEST[MAXVL:16] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-59, “Type E9NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction multiplies packed FP16 values from source operands and stores the packed FP16 result in the desti-
nation operand. The destination elements are updated according to the writemask.
Operation
VMULPH (EVEX encoded versions) when src2 operand is a register
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.fp16[j] := SRC1.fp16[j] * SRC2.fp16[j]
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
DEST.fp16[j] := SRC1.fp16[j] * SRC2.fp16[0]
ELSE:
DEST.fp16[j] := SRC1.fp16[j] * SRC2.fp16[j]
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction multiplies the low FP16 value from the source operands and stores the FP16 result in the destina-
tion operand. Bits 127:16 of the destination operand are copied from the corresponding bits of the first source
operand. Bits MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is
updated according to the writemask.
Operation
VMULSH (EVEX encoded versions)
IF EVEX.b = 1 and SRC2 is a register:
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)
DEST[127:16] := SRC1[127:16]
DEST[MAXVL-1:VL] := 0
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs an element-by-element blending of byte/word elements between the first source operand byte vector
register and the second source operand byte vector from memory or register, using the instruction mask as
selector. The result is written into the destination byte vector register.
The destination and first source operands are ZMM/YMM/XMM registers. The second source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit memory location.
The mask is not used as a writemask for this instruction. Instead, the mask is used as an element selector: every
element of the destination is conditionally selected between first source or second source using the value of the
related mask bit (0 for first source, 1 for second source).
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SRC2[i+7:i]
ELSE
IF *merging-masking* ; merging-masking
THEN DEST[i+7:i] := SRC1[i+7:i]
ELSE ; zeroing-masking
DEST[i+7:i] := 0
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0;
Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs an element-by-element blending of dword/qword elements between the first source operand (the second
operand) and the elements of the second source operand (the third operand) using an opmask register as select
control. The blended result is written into the destination.
The destination and first source operands are ZMM registers. The second source operand can be a ZMM register, a
512-bit memory location or a 512-bit vector broadcasted from a 32-bit memory location.
The opmask register is not used as a writemask for this instruction. Instead, the mask is used as an element
selector: every element of the destination is conditionally selected between first source or second source using the
value of the related mask bit (0 for the first source operand, 1 for the second source operand).
If EVEX.z is set, the elements with corresponding mask bit value of 0 in the destination operand are zeroed.
Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
2. EVEX.W in non-64 bit is ignored; the instruction behaves as if the W0 version is used.
VPBROADCASTB/W/D/Q—Load With Broadcast Integer Data From General Purpose Register Vol. 2C 5-427
Description
Broadcasts a 8-bit, 16-bit, 32-bit or 64-bit value from a general-purpose register (the second operand) to all the
locations in the destination vector register (the first operand) using the writemask k1.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
VPBROADCASTB (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := SRC[7:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VPBROADCASTB/W/D/Q—Load With Broadcast Integer Data From General Purpose Register Vol. 2C 5-428
VPBROADCASTQ (EVEX encoded versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := SRC[63:0]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Exceptions
EVEX-encoded instructions, see Table 2-57, “Type E7NM Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VPBROADCASTB/W/D/Q—Load With Broadcast Integer Data From General Purpose Register Vol. 2C 5-429
VPBROADCAST—Load Integer and Broadcast
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W0 78 /r A V/V AVX2 Broadcast a byte integer in the source
VPBROADCASTB xmm1, xmm2/m8 operand to sixteen locations in xmm1.
VEX.256.66.0F38.W0 78 /r A V/V AVX2 Broadcast a byte integer in the source
VPBROADCASTB ymm1, xmm2/m8 operand to thirty-two locations in ymm1.
EVEX.128.66.0F38.W0 78 /r B V/V (AVX512VL AND Broadcast a byte integer in the source
VPBROADCASTB xmm1{k1}{z}, xmm2/m8 AVX512BW) OR operand to locations in xmm1 subject to
AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 78 /r B V/V (AVX512VL AND Broadcast a byte integer in the source
VPBROADCASTB ymm1{k1}{z}, xmm2/m8 AVX512BW) OR operand to locations in ymm1 subject to
AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 78 /r B V/V AVX512BW Broadcast a byte integer in the source
VPBROADCASTB zmm1{k1}{z}, xmm2/m8 OR AVX10.11 operand to 64 locations in zmm1 subject to
writemask k1.
VEX.128.66.0F38.W0 79 /r A V/V AVX2 Broadcast a word integer in the source
VPBROADCASTW xmm1, xmm2/m16 operand to eight locations in xmm1.
VEX.256.66.0F38.W0 79 /r A V/V AVX2 Broadcast a word integer in the source
VPBROADCASTW ymm1, xmm2/m16 operand to sixteen locations in ymm1.
EVEX.128.66.0F38.W0 79 /r B V/V (AVX512VL AND Broadcast a word integer in the source
VPBROADCASTW xmm1{k1}{z}, xmm2/m16 AVX512BW) OR operand to locations in xmm1 subject to
AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 79 /r B V/V (AVX512VL AND Broadcast a word integer in the source
VPBROADCASTW ymm1{k1}{z}, xmm2/m16 AVX512BW) OR operand to locations in ymm1 subject to
AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 79 /r B V/V AVX512BW Broadcast a word integer in the source
VPBROADCASTW zmm1{k1}{z}, xmm2/m16 OR AVX10.11 operand to 32 locations in zmm1 subject to
writemask k1.
VEX.128.66.0F38.W0 58 /r A V/V AVX2 Broadcast a dword integer in the source
VPBROADCASTD xmm1, xmm2/m32 operand to four locations in xmm1.
VEX.256.66.0F38.W0 58 /r A V/V AVX2 Broadcast a dword integer in the source
VPBROADCASTD ymm1, xmm2/m32 operand to eight locations in ymm1.
EVEX.128.66.0F38.W0 58 /r B V/V (AVX512VL AND Broadcast a dword integer in the source
VPBROADCASTD xmm1 {k1}{z}, xmm2/m32 AVX512F) OR operand to locations in xmm1 subject to
AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 58 /r B V/V (AVX512VL AND Broadcast a dword integer in the source
VPBROADCASTD ymm1 {k1}{z}, xmm2/m32 AVX512F) OR operand to locations in ymm1 subject to
AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 58 /r B V/V AVX512F Broadcast a dword integer in the source
VPBROADCASTD zmm1 {k1}{z}, xmm2/m32 OR AVX10.11 operand to locations in zmm1 subject to
writemask k1.
VEX.128.66.0F38.W0 59 /r A V/V AVX2 Broadcast a qword element in source
VPBROADCASTQ xmm1, xmm2/m64 operand to two locations in xmm1.
VEX.256.66.0F38.W0 59 /r A V/V AVX2 Broadcast a qword element in source
VPBROADCASTQ ymm1, xmm2/m64 operand to four locations in ymm1.
EVEX.128.66.0F38.W1 59 /r B V/V (AVX512VL AND Broadcast a qword element in source
VPBROADCASTQ xmm1 {k1}{z}, xmm2/m64 AVX512F) OR operand to locations in xmm1 subject to
AVX10.11 writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Load integer data from the source operand (the second operand) and broadcast to all elements of the destination
operand (the first operand).
VEX256-encoded VPBROADCASTB/W/D/Q: The source operand is 8-bit, 16-bit, 32-bit, 64-bit memory location or
the low 8-bit, 16-bit 32-bit, 64-bit data in an XMM register. The destination operand is a YMM register. VPBROAD-
CASTI128 support the source operand of 128-bit memory location. Register source encodings for VPBROADCAS-
TI128 is reserved and will #UD. Bits (MAXVL-1:256) of the destination register are zeroed.
EVEX-encoded VPBROADCASTD/Q: The source operand is a 32-bit, 64-bit memory location or the low 32-bit, 64-
bit data in an XMM register. The destination operand is a ZMM/YMM/XMM register and updated according to the
writemask k1.
VPBROADCASTI32X4 and VPBROADCASTI64X4: The destination operand is a ZMM register and updated according
to the writemask k1. The source operand is 128-bit or 256-bit memory location. Register source encodings for
VBROADCASTI32X4 and VBROADCASTI64X4 are reserved and will #UD.
Note: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
If VPBROADCASTI128 is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will
cause an #UD exception.
m32 X0
DEST X0 X0 X0 X0 X0 X0 X0 X0
m32 X0
DEST 0 0 0 0 X0 X0 X0 X0
DEST X0 X0 X0 X0
m128 X0
DEST X0 X0
m256 X0
DEST X0 X0
Other Exceptions
EVEX-encoded instructions, see Table 2-23, “Type 6 Class Exception Conditions.”
EVEX-encoded instructions, syntax with reg/mem operand, see Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If VEX.L = 0 for VPBROADCASTQ, VPBROADCASTI128.
If EVEX.L’L = 0 for VBROADCASTI32X4/VBROADCASTI64X2.
If EVEX.L’L < 10b for VBROADCASTI32X8/VBROADCASTI64X4.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Broadcasts the zero-extended 64/32 bit value of the low byte/word of the source operand (the second operand) to
each 64/32 bit element of the destination operand (the first operand). The source operand is an opmask register.
The destination operand is a ZMM register (EVEX.512), YMM register (EVEX.256), or XMM register (EVEX.128).
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
VPBROADCASTMB2Q
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j*64
DEST[i+63:i] := ZeroExtend(SRC[7:0])
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instruction, see Table 2-56, “Type E6NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed byte values in the second source operand and the first source operand and
returns the results of the comparison to the mask destination operand. The comparison predicate operand (imme-
diate byte) specifies the type of comparison performed on each pair of packed values in the two source operands.
The result of each comparison is a single mask bit result of 1 (comparison true) or 0 (comparison false).
VPCMPB performs a comparison between pairs of signed byte values.
VPCMPUB performs a comparison between pairs of unsigned byte values.
The first source operand (second operand) is a ZMM/YMM/XMM register. The second source operand can be a
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand (first operand) is a mask
register k1. Up to 64/32/16 comparisons are performed with results written to the destination operand under the
writemask k2.
Operation
CASE (COMPARISON PREDICATE) OF
0: OP := EQ;
1: OP := LT;
2: OP := LE;
3: OP := FALSE;
4: OP := NEQ;
5: OP := NLT;
6: OP := NLE;
7: OP := TRUE;
ESAC;
Other Exceptions
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed integer values in the second source operand and the first source operand
and returns the results of the comparison to the mask destination operand. The comparison predicate operand
(immediate byte) specifies the type of comparison performed on each pair of packed values in the two source oper-
ands. The result of each comparison is a single mask bit result of 1 (comparison true) or 0 (comparison false).
VPCMPD/VPCMPUD performs a comparison between pairs of signed/unsigned doubleword integer values.
The first source operand (second operand) is a ZMM/YMM/XMM register. The second source operand can be a
ZMM/YMM/XMM register or a 512/256/128-bit memory location or a 512-bit vector broadcasted from a 32-bit
memory location. The destination operand (first operand) is a mask register k1. Up to 16/8/4 comparisons are
performed with results written to the destination operand under the writemask k2.
Operation
CASE (COMPARISON PREDICATE) OF
0: OP := EQ;
1: OP := LT;
2: OP := LE;
3: OP := FALSE;
4: OP := NEQ;
5: OP := NLT;
6: OP := NLE;
7: OP := TRUE;
ESAC;
Other Exceptions
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed integer values in the second source operand and the first source operand
and returns the results of the comparison to the mask destination operand. The comparison predicate operand
(immediate byte) specifies the type of comparison performed on each pair of packed values in the two source oper-
ands. The result of each comparison is a single mask bit result of 1 (comparison true) or 0 (comparison false).
VPCMPQ/VPCMPUQ performs a comparison between pairs of signed/unsigned quadword integer values.
The first source operand (second operand) is a ZMM/YMM/XMM register. The second source operand can be a
ZMM/YMM/XMM register or a 512/256/128-bit memory location or a 512-bit vector broadcasted from a 64-bit
memory location. The destination operand (first operand) is a mask register k1. Up to 8/4/2 comparisons are
performed with results written to the destination operand under the writemask k2.
The comparison predicate operand is an 8-bit immediate: bits 2:0 define the type of comparison to be performed.
Bits 3 through 7 of the immediate are reserved. Compiler can implement the pseudo-op mnemonic listed in Table
5-19.
Other Exceptions
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a SIMD compare of the packed integer word in the second source operand and the first source operand
and returns the results of the comparison to the mask destination operand. The comparison predicate operand
(immediate byte) specifies the type of comparison performed on each pair of packed values in the two source oper-
ands. The result of each comparison is a single mask bit result of 1 (comparison true) or 0 (comparison false).
VPCMPW performs a comparison between pairs of signed word values.
VPCMPUW performs a comparison between pairs of unsigned word values.
The first source operand (second operand) is a ZMM/YMM/XMM register. The second source operand can be a
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand (first operand) is a mask
register k1. Up to 32/16/8 comparisons are performed with results written to the destination operand under the
writemask k2.
The comparison predicate operand is an 8-bit immediate: bits 2:0 define the type of comparison to be performed.
Bits 3 through 7 of the immediate are reserved. Compiler can implement the pseudo-op mnemonic listed in Table
5-19.
Other Exceptions
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
VPCOMPRESSB/VCOMPRESSW—Store Sparse Packed Byte/Word Integer Values Into Dense Memory/Register Vol. 2C 5-453
Instruction Operand Encoding
Op/En Tuple Operand 1 Operand 2 Operand 3 Operand 4
A Tuple1 Scalar ModRM:r/m (w) ModRM:reg (r) N/A N/A
B N/A ModRM:r/m (w) ModRM:reg (r) N/A N/A
Description
Compress (stores) up to 64 byte values or 32 word values from the source operand (second operand) to the desti-
nation operand (first operand), based on the active elements determined by the writemask operand. Note:
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Moves up to 512 bits of packed byte values from the source operand (second operand) to the destination operand
(first operand). This instruction is used to store partial contents of a vector register into a byte vector or single
memory location using the active elements in operand writemask.
Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z
must be zero.
Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the
source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper
bits are zeroed.
This instruction supports memory fault suppression.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.
Operation
VPCOMPRESSB store form
(KL, VL) = (16, 128), (32, 256), (64, 512)
k := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.byte[k] := SRC.byte[j]
k := k +1
VPCOMPRESSB/VCOMPRESSW—Store Sparse Packed Byte/Word Integer Values Into Dense Memory/Register Vol. 2C 5-454
VPCOMPRESSW reg-reg form
(KL, VL) = (8, 128), (16, 256), (32, 512)
k := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.word[k] := SRC.word[j]
k := k + 1
IF *merging-masking*:
*DEST[VL-1:k*16] remains unchanged*
ELSE DEST[VL-1:k*16] := 0
DEST[MAX_VL-1:VL] := 0
Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
VPCOMPRESSB/VCOMPRESSW—Store Sparse Packed Byte/Word Integer Values Into Dense Memory/Register Vol. 2C 5-455
VPCOMPRESSD—Store Sparse Packed Doubleword Integer Values Into Dense Memory/Register
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 8B /r A V/V (AVX512VL AND Compress packed doubleword integer
VPCOMPRESSD xmm1/m128 {k1}{z}, xmm2 AVX512F) OR values from xmm2 to xmm1/m128 using
AVX10.11 control mask k1.
EVEX.256.66.0F38.W0 8B /r A V/V (AVX512VL AND Compress packed doubleword integer
VPCOMPRESSD ymm1/m256 {k1}{z}, ymm2 AVX512F) OR values from ymm2 to ymm1/m256 using
AVX10.11 control mask k1.
EVEX.512.66.0F38.W0 8B /r A V/V AVX512F Compress packed doubleword integer
VPCOMPRESSD zmm1/m512 {k1}{z}, zmm2 OR AVX10.11 values from zmm2 to zmm1/m512 using
control mask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Compress (store) up to 16/8/4 doubleword integer values from the source operand (second operand) to the desti-
nation operand (first operand). The source operand is a ZMM/YMM/XMM register, the destination operand can be a
ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The opmask register k1 selects the active elements (partial vector or possibly non-contiguous if less than 16 active
elements) from the source operand to compress into a contiguous vector. The contiguous vector is written to the
destination starting from the low element of the destination operand.
Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z
must be zero.
Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the
source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper
bits are zeroed.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.
VPCOMPRESSD—Store Sparse Packed Doubleword Integer Values Into Dense Memory/Register Vol. 2C 5-456
Operation
VPCOMPRESSD (EVEX encoded versions) store form
(KL, VL) = (4, 128), (8, 256), (16, 512)
SIZE := 32
k := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no controlmask*
THEN
DEST[k+SIZE-1:k] := SRC[i+31:i]
k := k + SIZE
FI;
ENDFOR;
Other Exceptions
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
VPCOMPRESSD—Store Sparse Packed Doubleword Integer Values Into Dense Memory/Register Vol. 2C 5-457
VPCOMPRESSQ—Store Sparse Packed Quadword Integer Values Into Dense Memory/Register
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 8B /r A V/V (AVX512VL AND Compress packed quadword integer values
VPCOMPRESSQ xmm1/m128 {k1}{z}, xmm2 AVX512F) OR from xmm2 to xmm1/m128 using control
AVX10.11 mask k1.
EVEX.256.66.0F38.W1 8B /r A V/V (AVX512VL AND Compress packed quadword integer values
VPCOMPRESSQ ymm1/m256 {k1}{z}, ymm2 AVX512F) OR from ymm2 to ymm1/m256 using control
AVX10.11 mask k1.
EVEX.512.66.0F38.W1 8B /r A V/V AVX512F Compress packed quadword integer values
VPCOMPRESSQ zmm1/m512 {k1}{z}, zmm2 OR AVX10.11 from zmm2 to zmm1/m512 using control
mask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Compress (stores) up to 8/4/2 quadword integer values from the source operand (second operand) to the destina-
tion operand (first operand). The source operand is a ZMM/YMM/XMM register, the destination operand can be a
ZMM/YMM/XMM register or a 512/256/128-bit memory location.
The opmask register k1 selects the active elements (partial vector or possibly non-contiguous if less than 8 active
elements) from the source operand to compress into a contiguous vector. The contiguous vector is written to the
destination starting from the low element of the destination operand.
Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z
must be zero.
Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the
source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper
bits are zeroed.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.
VPCOMPRESSQ—Store Sparse Packed Quadword Integer Values Into Dense Memory/Register Vol. 2C 5-458
Operation
VPCOMPRESSQ (EVEX encoded versions) store form
(KL, VL) = (2, 128), (4, 256), (8, 512)
SIZE := 64
k := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no controlmask*
THEN
DEST[k+SIZE-1:k] := SRC[i+63:i]
k := k + SIZE
FI;
ENFOR
Other Exceptions
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
VPCOMPRESSQ—Store Sparse Packed Quadword Integer Values Into Dense Memory/Register Vol. 2C 5-459
VPCONFLICTD/Q—Detect Conflicts Within a Vector of Packed Dword/Qword Values Into Dense
Memory/ Register
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 C4 /r A V/V (AVX512VL AND Detect duplicate double-word values in
VPCONFLICTD xmm1 {k1}{z}, AVX512CD) OR xmm2/m128/m32bcst using writemask k1.
xmm2/m128/m32bcst AVX10.11
EVEX.256.66.0F38.W0 C4 /r A V/V (AVX512VL AND Detect duplicate double-word values in
VPCONFLICTD ymm1 {k1}{z}, AVX512CD) OR ymm2/m256/m32bcst using writemask k1.
ymm2/m256/m32bcst AVX10.11
EVEX.512.66.0F38.W0 C4 /r A V/V AVX512CD Detect duplicate double-word values in
VPCONFLICTD zmm1 {k1}{z}, OR AVX10.11 zmm2/m512/m32bcst using writemask k1.
zmm2/m512/m32bcst
EVEX.128.66.0F38.W1 C4 /r A V/V (AVX512VL AND Detect duplicate quad-word values in
VPCONFLICTQ xmm1 {k1}{z}, AVX512CD) OR xmm2/m128/m64bcst using writemask k1.
xmm2/m128/m64bcst AVX10.11
EVEX.256.66.0F38.W1 C4 /r A V/V (AVX512VL AND Detect duplicate quad-word values in
VPCONFLICTQ ymm1 {k1}{z}, AVX512CD) OR ymm2/m256/m64bcst using writemask k1.
ymm2/m256/m64bcst AVX10.11
EVEX.512.66.0F38.W1 C4 /r A V/V AVX512CD Detect duplicate quad-word values in
VPCONFLICTQ zmm1 {k1}{z}, OR AVX10.11 zmm2/m512/m64bcst using writemask k1.
zmm2/m512/m64bcst
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Test each dword/qword element of the source operand (the second operand) for equality with all other elements in
the source operand closer to the least significant element. Each element’s comparison results form a bit vector,
which is then zero extended and written to the destination according to the writemask.
EVEX.512 encoded version: The source operand is a ZMM register, a 512-bit memory location, or a 512-bit vector
broadcasted from a 32/64-bit memory location. The destination operand is a ZMM register, conditionally updated
using writemask k1.
EVEX.256 encoded version: The source operand is a YMM register, a 256-bit memory location, or a 256-bit vector
broadcasted from a 32/64-bit memory location. The destination operand is a YMM register, conditionally updated
using writemask k1.
EVEX.128 encoded version: The source operand is a XMM register, a 128-bit memory location, or a 128-bit vector
broadcasted from a 32/64-bit memory location. The destination operand is a XMM register, conditionally updated
using writemask k1.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
VPCONFLICTD/Q—Detect Conflicts Within a Vector of Packed Dword/Qword Values Into Dense Memory/ Register Vol. 2C 5-460
Operation
VPCONFLICTD
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j*32
IF MaskBit(j) OR *no writemask* THEN
FOR k := 0 TO j-1
m := k*32
IF ((SRC[i+31:i] = SRC[m+31:m])) THEN
DEST[i+k] := 1
ELSE
DEST[i+k] := 0
FI
ENDFOR
DEST[i+31:i+j] := 0
ELSE
IF *merging-masking* THEN
*DEST[i+31:i] remains unchanged*
ELSE
DEST[i+31:i] := 0
FI
FI
ENDFOR
DEST[MAXVL-1:VL] := 0
VPCONFLICTQ
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j*64
IF MaskBit(j) OR *no writemask* THEN
FOR k := 0 TO j-1
m := k*64
IF ((SRC[i+63:i] = SRC[m+63:m])) THEN
DEST[i+k] := 1
ELSE
DEST[i+k] := 0
FI
ENDFOR
DEST[i+63:i+j] := 0
ELSE
IF *merging-masking* THEN
*DEST[i+63:i] remains unchanged*
ELSE
DEST[i+63:i] := 0
FI
FI
ENDFOR
DEST[MAXVL-1:VL] := 0
VPCONFLICTD/Q—Detect Conflicts Within a Vector of Packed Dword/Qword Values Into Dense Memory/ Register Vol. 2C 5-461
Intel C/C++ Compiler Intrinsic Equivalent
Other Exceptions
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions.”
VPCONFLICTD/Q—Detect Conflicts Within a Vector of Packed Dword/Qword Values Into Dense Memory/ Register Vol. 2C 5-462
VPDPBUSD—Multiply and Add Unsigned and Signed Bytes
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W0 50 /r A V/V AVX-VNNI Multiply groups of 4 pairs of signed bytes in
VPDPBUSD xmm1, xmm2, xmm3/m128 with corresponding unsigned bytes of
xmm3/m128 xmm2, summing those products and adding them
to doubleword result in xmm1.
VEX.256.66.0F38.W0 50 /r A V/V AVX-VNNI Multiply groups of 4 pairs of signed bytes in
VPDPBUSD ymm1, ymm2, ymm3/m256 with corresponding unsigned bytes of
ymm3/m256 ymm2, summing those products and adding them
to doubleword result in ymm1.
EVEX.128.66.0F38.W0 50 /r B V/V (AVX512_VNNI Multiply groups of 4 pairs of signed bytes in
VPDPBUSD xmm1{k1}{z}, xmm2, AND AVX512VL) xmm3/m128/m32bcst with corresponding
xmm3/m128/m32bcst OR AVX10.11 unsigned bytes of xmm2, summing those products
and adding them to doubleword result in xmm1
under writemask k1.
EVEX.256.66.0F38.W0 50 /r B V/V (AVX512_VNNI Multiply groups of 4 pairs of signed bytes in
VPDPBUSD ymm1{k1}{z}, ymm2, AND AVX512VL) ymm3/m256/m32bcst with corresponding
ymm3/m256/m32bcst OR AVX10.11 unsigned bytes of ymm2, summing those products
and adding them to doubleword result in ymm1
under writemask k1.
EVEX.512.66.0F38.W0 50 /r B V/V AVX512_VNNI Multiply groups of 4 pairs of signed bytes in
VPDPBUSD zmm1{k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m32bcst with corresponding
zmm3/m512/m32bcst unsigned bytes of zmm2, summing those products
and adding them to doubleword result in zmm1
under writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second
source operand, producing intermediate signed word results. The word results are then summed and accumulated
in the destination dword element size operand.
This instruction supports memory fault suppression.
ORIGDEST := DEST
FOR i := 0 TO KL-1:
// Extending to 16b
// src1extend := ZERO_EXTEND
// src2extend := SIGN_EXTEND
DEST[MAX_VL-1:VL] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second
source operand, producing intermediate signed word results. The word results are then summed and accumulated
in the destination dword element size operand. If the intermediate sum overflows a 32b signed number the result
is saturated to either 0x7FFF_FFFF for positive numbers of 0x8000_0000 for negative numbers.
This instruction supports memory fault suppression.
VPDPBUSDS—Multiply and Add Unsigned and Signed Bytes With Saturation Vol. 2C 5-468
Operation
VPDPBUSDS dest, src1, src2 (VEX encoded versions)
VL=(128, 256)
KL=VL/32
ORIGDEST := DEST
FOR i := 0 TO KL-1:
// Extending to 16b
// src1extend := ZERO_EXTEND
// src2extend := SIGN_EXTEND
DEST[MAX_VL-1:VL] := 0
VPDPBUSDS—Multiply and Add Unsigned and Signed Bytes With Saturation Vol. 2C 5-469
SIMD Floating-Point Exceptions
None.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
VPDPBUSDS—Multiply and Add Unsigned and Signed Bytes With Saturation Vol. 2C 5-470
VPDPWSSD—Multiply and Add Signed Word Integers
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W0 52 /r A V/V AVX-VNNI Multiply groups of 2 pairs signed words in
VPDPWSSD xmm1, xmm2, xmm3/m128 with corresponding signed words
xmm3/m128 of xmm2, summing those products and adding
them to doubleword result in xmm1.
VEX.256.66.0F38.W0 52 /r A V/V AVX-VNNI Multiply groups of 2 pairs signed words in
VPDPWSSD ymm1, ymm2, ymm3/m256 with corresponding signed words
ymm3/m256 of ymm2, summing those products and adding
them to doubleword result in ymm1.
EVEX.128.66.0F38.W0 52 /r B V/V (AVX512_VNNI Multiply groups of 2 pairs signed words in
VPDPWSSD xmm1{k1}{z}, xmm2, AND AVX512VL) xmm3/m128/m32bcst with corresponding
xmm3/m128/m32bcst OR AVX10.11 signed words of xmm2, summing those
products and adding them to doubleword result
in xmm1, under writemask k1.
EVEX.256.66.0F38.W0 52 /r B V/V (AVX512_VNNI Multiply groups of 2 pairs signed words in
VPDPWSSD ymm1{k1}{z}, ymm2, AND AVX512VL) ymm3/m256/m32bcst with corresponding
ymm3/m256/m32bcst OR AVX10.11 signed words of ymm2, summing those
products and adding them to doubleword result
in ymm1, under writemask k1.
EVEX.512.66.0F38.W0 52 /r B V/V AVX512_VNNI Multiply groups of 2 pairs signed words in
VPDPWSSD zmm1{k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m32bcst with corresponding
zmm3/m512/m32bcst signed words of zmm2, summing those
products and adding them to doubleword result
in zmm1, under writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Multiplies the individual signed words of the first source operand by the corresponding signed words of the second
source operand, producing intermediate signed, doubleword results. The adjacent doubleword results are then
summed and accumulated in the destination operand.
This instruction supports memory fault suppression.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Multiplies the individual signed words of the first source operand by the corresponding signed words of the second
source operand, producing intermediate signed, doubleword results. The adjacent doubleword results are then
summed and accumulated in the destination operand. If the intermediate sum overflows a 32b signed number, the
result is saturated to either 0x7FFF_FFFF for positive numbers of 0x8000_0000 for negative numbers.
This instruction supports memory fault suppression.
VPDPWSSDS—Multiply and Add Signed Word Integers With Saturation Vol. 2C 5-472
Operation
VPDPWSSDS dest, src1, src2 (VEX encoded versions)
VL=(128, 256)
KL=VL/32
ORIGDEST := DEST
FOR i := 0 TO KL-1:
p1dword := SIGN_EXTEND(SRC1.word[2*i+0]) * SIGN_EXTEND(SRC2.word[2*i+0])
p2dword := SIGN_EXTEND(SRC1.word[2*i+1]) * SIGN_EXTEND(SRC2.word[2*i+1])
DEST.dword[i] := SIGNED_DWORD_SATURATE(ORIGDEST.dword[i] + p1dword + p2dword)
DEST[MAX_VL-1:VL] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
VPDPWSSDS—Multiply and Add Signed Word Integers With Saturation Vol. 2C 5-473
VPERMB—Permute Packed Bytes Elements
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 8D /r A V/V (AVX512VL AND Permute bytes in xmm3/m128 using byte indexes
VPERMB xmm1 {k1}{z}, xmm2, AVX512_VBMI) in xmm2 and store the result in xmm1 using
xmm3/m128 OR AVX10.11 writemask k1.
EVEX.256.66.0F38.W0 8D /r A V/V AVX512VL Permute bytes in ymm3/m256 using byte indexes
VPERMB ymm1 {k1}{z}, ymm2, AVX512_VBMI) in ymm2 and store the result in ymm1 using
ymm3/m256 OR AVX10.11 writemask k1.
EVEX.512.66.0F38.W0 8D /r A V/V AVX512_VBMI Permute bytes in zmm3/m512 using byte indexes
VPERMB zmm1 {k1}{z}, zmm2, OR AVX10.11 in zmm2 and store the result in zmm1 using
zmm3/m512 writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Copies bytes from the second source operand (the third operand) to the destination operand (the first operand)
according to the byte indices in the first source operand (the second operand). Note that this instruction permits a
byte in the source operand to be copied to more than one location in the destination operand.
Only the low 6(EVEX.512)/5(EVEX.256)/4(EVEX.128) bits of each byte index is used to select the location of the
source byte from the second source operand.
The first source operand is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM reg-
ister, a 512/256/128-bit memory location. The destination operand is a ZMM/YMM/XMM register updated at byte
granularity by the writemask k1.
Operation
VPERMB (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
IF VL = 128:
n := 3;
ELSE IF VL = 256:
n := 4;
ELSE IF VL = 512:
n := 5;
FI;
FOR j := 0 TO KL-1:
id := SRC1[j*8 + n : j*8] ; // location of the source byte
IF k1[j] OR *no writemask* THEN
DEST[j*8 + 7: j*8] := SRC2[id*8 +7: id*8];
ELSE IF zeroing-masking THEN
DEST[j*8 + 7: j*8] := 0;
*ELSE
DEST[j*8 + 7: j*8] remains unchanged*
FI
Other Exceptions
See Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Copies doublewords (or words) from the second source operand (the third operand) to the destination operand
(the first operand) according to the indices in the first source operand (the second operand). Note that this instruc-
tion permits a doubleword (word) in the source operand to be copied to more than one location in the destination
operand.
VEX.256 encoded VPERMD: The first and second operands are YMM registers, the third operand can be a YMM
register or memory location. Bits (MAXVL-1:256) of the corresponding destination register are zeroed.
EVEX encoded VPERMD: The first and second operands are ZMM/YMM registers, the third operand can be a
ZMM/YMM register, a 512/256-bit memory location or a 512/256-bit vector broadcasted from a 32-bit memory
location. The elements in the destination are updated using the writemask k1.
VPERMW: first and second operands are ZMM/YMM/XMM registers, the third operand can be a ZMM/YMM/XMM
register, or a 512/256/128-bit memory location. The destination is updated using the writemask k1.
EVEX.128 encoded versions: Bits (MAXVL-1:128) of the corresponding ZMM register are zeroed.
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPERMD, see Table 2-52, “Type E4NF Class Exception Conditions.”
EVEX-encoded VPERMW, see Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If VEX.L = 0.
If EVEX.L’L = 0 for VPERMD.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Permutes byte values in the second operand (the first source operand) and the third operand (the second source
operand) using the byte indices in the first operand (the destination operand) to select byte elements from the
second or third source operands. The selected byte elements are written to the destination at byte granularity
under the writemask k1.
The first and second operands are ZMM/YMM/XMM registers. The first operand contains input indices to select
elements from the two input tables in the 2nd and 3rd operands. The first operand is also the destination of the
result. The third operand can be a ZMM/YMM/XMM register, or a 512/256/128-bit memory location. In each index
byte, the id bit for table selection is bit 6/5/4, and bits [5:0]/[4:0]/[3:0] selects element within each input table.
Note that these instructions permit a byte value in the source operands to be copied to more than one location in
the destination operand. Also, the same tables can be reused in subsequent iterations, but the index elements are
overwritten.
Bits (MAX_VL-1:256/128) of the destination are zeroed for VL=256,128.
VPERMI2B—Full Permute of Bytes From Two Tables Overwriting the Index Vol. 2C 5-486
Operation
VPERMI2B (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
IF VL = 128:
id := 3;
ELSE IF VL = 256:
id := 4;
ELSE IF VL = 512:
id := 5;
FI;
TMP_DEST[VL-1:0] := DEST[VL-1:0];
FOR j := 0 TO KL-1
off := 8*SRC1[j*8 + id: j*8] ;
IF k1[j] OR *no writemask*:
DEST[j*8 + 7: j*8] := TMP_DEST[j*8+id+1]? SRC2[off+7:off] : SRC1[off+7:off];
ELSE IF *zeroing-masking*
DEST[j*8 + 7: j*8] := 0;
*ELSE
DEST[j*8 + 7: j*8] remains unchanged*
FI;
ENDFOR
DEST[MAX_VL-1:VL] := 0;
Other Exceptions
See Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
VPERMI2B—Full Permute of Bytes From Two Tables Overwriting the Index Vol. 2C 5-487
VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 75 /r A V/V (AVX512VL AND Permute word integers from two tables in
VPERMI2W xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 and xmm2 using indexes in xmm1
xmm3/m128 AVX10.11 and store the result in xmm1 using writemask k1.
EVEX.256.66.0F38.W1 75 /r A V/V (AVX512VL AND Permute word integers from two tables in
VPERMI2W ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm3/m256 and ymm2 using indexes in ymm1
ymm3/m256 AVX10.11 and store the result in ymm1 using writemask k1.
EVEX.512.66.0F38.W1 75 /r A V/V AVX512BW Permute word integers from two tables in
VPERMI2W zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512 and zmm2 using indexes in zmm1
zmm3/m512 and store the result in zmm1 using writemask k1.
EVEX.128.66.0F38.W0 76 /r B V/V (AVX512VL AND Permute double-words from two tables in
VPERMI2D xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m32bcst and xmm2 using indexes in
xmm3/m128/m32bcst AVX10.11 xmm1 and store the result in xmm1 using
writemask k1.
EVEX.256.66.0F38.W0 76 /r B V/V (AVX512VL AND Permute double-words from two tables in
VPERMI2D ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m32bcst and ymm2 using indexes in
ymm3/m256/m32bcst AVX10.11 ymm1 and store the result in ymm1 using
writemask k1.
EVEX.512.66.0F38.W0 76 /r B V/V AVX512F Permute double-words from two tables in
VPERMI2D zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m32bcst and zmm2 using indices in
zmm3/m512/m32bcst zmm1 and store the result in zmm1 using
writemask k1.
EVEX.128.66.0F38.W1 76 /r B V/V (AVX512VL AND Permute quad-words from two tables in
VPERMI2Q xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m64bcst and xmm2 using indexes in
xmm3/m128/m64bcst AVX10.11 xmm1 and store the result in xmm1 using
writemask k1.
EVEX.256.66.0F38.W1 76 /r B V/V (AVX512VL AND Permute quad-words from two tables in
VPERMI2Q ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m64bcst and ymm2 using indexes in
ymm3/m256/m64bcst AVX10.11 ymm1 and store the result in ymm1 using
writemask k1.
EVEX.512.66.0F38.W1 76 /r B V/V AVX512F Permute quad-words from two tables in
VPERMI2Q zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m64bcst and zmm2 using indices in
zmm3/m512/m64bcst zmm1 and store the result in zmm1 using
writemask k1.
EVEX.128.66.0F38.W0 77 /r B V/V (AVX512VL AND Permute single-precision floating-point values
VPERMI2PS xmm1 {k1}{z}, xmm2, AVX512F) OR from two tables in xmm3/m128/m32bcst and
xmm3/m128/m32bcst AVX10.11 xmm2 using indexes in xmm1 and store the result
in xmm1 using writemask k1.
EVEX.256.66.0F38.W0 77 /r B V/V (AVX512VL AND Permute single-precision floating-point values
VPERMI2PS ymm1 {k1}{z}, ymm2, AVX512F) OR from two tables in ymm3/m256/m32bcst and
ymm3/m256/m32bcst AVX10.11 ymm2 using indexes in ymm1 and store the result
in ymm1 using writemask k1.
EVEX.512.66.0F38.W0 77 /r B V/V AVX512F Permute single-precision floating-point values
VPERMI2PS zmm1 {k1}{z}, zmm2, OR AVX10.11 from two tables in zmm3/m512/m32bcst and
zmm3/m512/m32bcst zmm2 using indices in zmm1 and store the result
in zmm1 using writemask k1.
VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index Vol. 2C 5-488
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 77 /r B V/V (AVX512VL AND Permute double precision floating-point values
VPERMI2PD xmm1 {k1}{z}, xmm2, AVX512F) OR from two tables in xmm3/m128/m64bcst and
xmm3/m128/m64bcst AVX10.11 xmm2 using indexes in xmm1 and store the result
in xmm1 using writemask k1.
EVEX.256.66.0F38.W1 77 /r B V/V (AVX512VL AND Permute double precision floating-point values
VPERMI2PD ymm1 {k1}{z}, ymm2, AVX512F) OR from two tables in ymm3/m256/m64bcst and
ymm3/m256/m64bcst AVX10.11 ymm2 using indexes in ymm1 and store the result
in ymm1 using writemask k1.
EVEX.512.66.0F38.W1 77 /r B V/V AVX512F Permute double precision floating-point values
VPERMI2PD zmm1 {k1}{z}, zmm2, OR AVX10.11 from two tables in zmm3/m512/m64bcst and
zmm3/m512/m64bcst zmm2 using indices in zmm1 and store the result
in zmm1 using writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Permutes 16-bit/32-bit/64-bit values in the second operand (the first source operand) and the third operand (the
second source operand) using indices in the first operand to select elements from the second and third operands.
The selected elements are written to the destination operand (the first operand) according to the writemask k1.
The first and second operands are ZMM/YMM/XMM registers. The first operand contains input indices to select
elements from the two input tables in the 2nd and 3rd operands. The first operand is also the destination of the
result.
D/Q/PS/PD element versions: The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit
memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. Broadcast from the
low 32/64-bit memory location is performed if EVEX.b and the id bit for table selection are set (selecting table_2).
Dword/PS versions: The id bit for table selection is bit 4/3/2, depending on VL=512, 256, 128. Bits
[3:0]/[2:0]/[1:0] of each element in the input index vector select an element within the two source operands, If
the id bit is 0, table_1 (the first source) is selected; otherwise the second source operand is selected.
Qword/PD versions: The id bit for table selection is bit 3/2/1, and bits [2:0]/[1:0] /bit 0 selects element within each
input table.
Word element versions: The second source operand can be a ZMM/YMM/XMM register, or a 512/256/128-bit
memory location. The id bit for table selection is bit 5/4/3, and bits [4:0]/[3:0]/[2:0] selects element within each
input table.
Note that these instructions permit a 16-bit/32-bit/64-bit value in the source operands to be copied to more than
one location in the destination operand. Note also that in this case, the same table can be reused for example for a
second iteration, while the index elements are overwritten.
Bits (MAXVL-1:256/128) of the destination are zeroed for VL=256,128.
VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index Vol. 2C 5-489
Operation
VPERMI2W (EVEX encoded versions)
(KL, VL) = (8, 128), (16, 256), (32, 512)
IF VL = 128
id := 2
FI;
IF VL = 256
id := 3
FI;
IF VL = 512
id := 4
FI;
TMP_DEST := DEST
FOR j := 0 TO KL-1
i := j * 16
off := 16*TMP_DEST[i+id:i]
IF k1[j] OR *no writemask*
THEN
DEST[i+15:i]=TMP_DEST[i+id+1] ? SRC2[off+15:off]
: SRC1[off+15:off]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index Vol. 2C 5-490
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index Vol. 2C 5-491
Intel C/C++ Compiler Intrinsic Equivalent
VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index Vol. 2C 5-492
VPERMI2W __m512i _mm512_permutex2var_epi16(__m512i a, __m512i idx, __m512i b);
VPERMI2W __m512i _mm512_mask_permutex2var_epi16(__m512i a, __mmask32 k, __m512i idx, __m512i b);
VPERMI2W __m512i _mm512_mask2_permutex2var_epi16(__m512i a, __m512i idx, __mmask32 k, __m512i b);
VPERMI2W __m512i _mm512_maskz_permutex2var_epi16(__mmask32 k, __m512i a, __m512i idx, __m512i b);
VPERMI2W __m256i _mm256_permutex2var_epi16(__m256i a, __m256i idx, __m256i b);
VPERMI2W __m256i _mm256_mask_permutex2var_epi16(__m256i a, __mmask16 k, __m256i idx, __m256i b);
VPERMI2W __m256i _mm256_mask2_permutex2var_epi16(__m256i a, __m256i idx, __mmask16 k, __m256i b);
VPERMI2W __m256i _mm256_maskz_permutex2var_epi16(__mmask16 k, __m256i a, __m256i idx, __m256i b);
VPERMI2W __m128i _mm_permutex2var_epi16(__m128i a, __m128i idx, __m128i b);
VPERMI2W __m128i _mm_mask_permutex2var_epi16(__m128i a, __mmask8 k, __m128i idx, __m128i b);
VPERMI2W __m128i _mm_mask2_permutex2var_epi16(__m128i a, __m128i idx, __mmask8 k, __m128i b);
VPERMI2W __m128i _mm_maskz_permutex2var_epi16(__mmask8 k, __m128i a, __m128i idx, __m128i b);
Other Exceptions
VPERMI2D/Q/PS/PD: See Table 2-52, “Type E4NF Class Exception Conditions.”
VPERMI2W: See Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
VPERMI2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting the Index Vol. 2C 5-493
VPERMILPD—Permute In-Lane of Pairs of Double Precision Floating-Point Values
Opcode/ Op / En 64/32 CPUID Feature Description
Instruction bit Mode Flag
Support
VEX.128.66.0F38.W0 0D /r A V/V AVX Permute double precision floating-point values
VPERMILPD xmm1, xmm2, in xmm2 using controls from xmm3/m128 and
xmm3/m128 store result in xmm1.
VEX.256.66.0F38.W0 0D /r A V/V AVX Permute double precision floating-point values
VPERMILPD ymm1, ymm2, in ymm2 using controls from ymm3/m256 and
ymm3/m256 store result in ymm1.
EVEX.128.66.0F38.W1 0D /r C V/V (AVX512VL AND Permute double precision floating-point values
VPERMILPD xmm1 {k1}{z}, xmm2, AVX512F) OR in xmm2 using control from
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst and store the result in
xmm1 using writemask k1.
EVEX.256.66.0F38.W1 0D /r C V/V (AVX512VL AND Permute double precision floating-point values
VPERMILPD ymm1 {k1}{z}, ymm2, AVX512F) OR in ymm2 using control from
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst and store the result in
ymm1 using writemask k1.
EVEX.512.66.0F38.W1 0D /r C V/V AVX512F Permute double precision floating-point values
VPERMILPD zmm1 {k1}{z}, zmm2, OR AVX10.11 in zmm2 using control from
zmm3/m512/m64bcst zmm3/m512/m64bcst and store the result in
zmm1 using writemask k1.
VEX.128.66.0F3A.W0 05 /r ib B V/V AVX Permute double precision floating-point values
VPERMILPD xmm1, xmm2/m128, in xmm2/m128 using controls from imm8.
imm8
VEX.256.66.0F3A.W0 05 /r ib B V/V AVX Permute double precision floating-point values
VPERMILPD ymm1, ymm2/m256, in ymm2/m256 using controls from imm8.
imm8
EVEX.128.66.0F3A.W1 05 /r ib D V/V (AVX512VL AND Permute double precision floating-point values
VPERMILPD xmm1 {k1}{z}, AVX512F) OR in xmm2/m128/m64bcst using controls from
xmm2/m128/m64bcst, imm8 AVX10.11 imm8 and store the result in xmm1 using
writemask k1.
EVEX.256.66.0F3A.W1 05 /r ib D V/V (AVX512VL AND Permute double precision floating-point values
VPERMILPD ymm1 {k1}{z}, AVX512F) OR in ymm2/m256/m64bcst using controls from
ymm2/m256/m64bcst, imm8 AVX10.11 imm8 and store the result in ymm1 using
writemask k1.
EVEX.512.66.0F3A.W1 05 /r ib D V/V AVX512F Permute double precision floating-point values
VPERMILPD zmm1 {k1}{z}, OR AVX10.11 in zmm2/m512/m64bcst using controls from
zmm2/m512/m64bcst, imm8 imm8 and store the result in zmm1 using
writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
(Variable control version)
Permute pairs of double precision floating-point values in the first source operand (second operand), each using a
1-bit control field residing in the corresponding quadword element of the second source operand (third operand).
Permuted results are stored in the destination operand (first operand).
The control bits are located at bit 0 of each quadword element (see Figure 5-24). Each control determines which of
the source element in an input pair is selected for the destination element. Each pair of source elements must lie in
the same 128-bit region as the destination.
EVEX version: The second source operand (third operand) is a ZMM/YMM/XMM register, a 512/256/128-bit
memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. Permuted results are
written to the destination under the writemask.
SRC1 X3 X2 X1 X0
VEX.256 encoded version: Bits (MAXVL-1:256) of the corresponding ZMM register are zeroed.
Bit
255 194 193 127 66 65 63 2 1
ignored
ignored
ignored
Immediate control version: Permute pairs of double precision floating-point values in the first source operand
(second operand), each pair using a 1-bit control field in the imm8 byte. Each element in the destination operand
(first operand) use a separate control bit of the imm8 byte.
Operation
VPERMILPD (EVEX immediate versions)
(KL, VL) = (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1) AND (SRC1 *is memory*)
THEN TMP_SRC1[i+63:i] := SRC1[63:0];
ELSE TMP_SRC1[i+63:i] := SRC1[i+63:i];
FI;
ENDFOR;
IF (imm8[0] = 0) THEN TMP_DEST[63:0] := SRC1[63:0]; FI;
IF (imm8[0] = 1) THEN TMP_DEST[63:0] := TMP_SRC1[127:64]; FI;
IF (imm8[1] = 0) THEN TMP_DEST[127:64] := TMP_SRC1[63:0]; FI;
IF (imm8[1] = 1) THEN TMP_DEST[127:64] := TMP_SRC1[127:64]; FI;
IF VL >= 256
IF (imm8[2] = 0) THEN TMP_DEST[191:128] := TMP_SRC1[191:128]; FI;
IF (imm8[2] = 1) THEN TMP_DEST[191:128] := TMP_SRC1[255:192]; FI;
IF (imm8[3] = 0) THEN TMP_DEST[255:192] := TMP_SRC1[191:128]; FI;
IF (imm8[3] = 1) THEN TMP_DEST[255:192] := TMP_SRC1[255:192]; FI;
FI;
IF VL >= 512
IF (imm8[4] = 0) THEN TMP_DEST[319:256] := TMP_SRC1[319:256]; FI;
IF (imm8[4] = 1) THEN TMP_DEST[319:256] := TMP_SRC1[383:320]; FI;
IF (imm8[5] = 0) THEN TMP_DEST[383:320] := TMP_SRC1[319:256]; FI;
IF (imm8[5] = 1) THEN TMP_DEST[383:320] := TMP_SRC1[383:320]; FI;
IF (imm8[6] = 0) THEN TMP_DEST[447:384] := TMP_SRC1[447:384]; FI;
IF (imm8[6] = 1) THEN TMP_DEST[447:384] := TMP_SRC1[511:448]; FI;
IF (imm8[7] = 0) THEN TMP_DEST[511:448] := TMP_SRC1[447:384]; FI;
IF (imm8[7] = 1) THEN TMP_DEST[511:448] := TMP_SRC1[511:448]; FI;
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
SRC1 X7 X6 X5 X4 X3 X2 X1 X0
Bit
255 226 225 224 63 34 33 32 31 1 0
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
Additionally:
#UD If VEX.W = 1.
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If either (E)VEX.vvvv != 1111B and with imm8.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
The imm8 version: Copies quadword elements of double precision floating-point values from the source operand
(the second operand) to the destination operand (the first operand) according to the indices specified by the imme-
diate operand (the third operand). Each two-bit value in the immediate byte selects a qword element in the source
operand.
VEX version: The source operand can be a YMM register or a memory location. Bits (MAXVL-1:256) of the corre-
sponding destination register are zeroed.
In EVEX.512 encoded version, The elements in the destination are updated using the writemask k1 and the imm8
bits are reused as control bits for the upper 256-bit half when the control bits are coming from immediate. The
source operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 64-bit
memory location.
The imm8 versions: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will #UD.
The vector control version: Copies quadword elements of double precision floating-point values from the second
source operand (the third operand) to the destination operand (the first operand) according to the indices in the
first source operand (the second operand). The first 3 bits of each 64 bit element in the index operand selects which
quadword in the second source operand to copy. The first and second operands are ZMM registers, the third
operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 64-bit memory
location. The elements in the destination are updated using the writemask k1.
Operation
VPERMPD (EVEX - imm8 control forms)
(KL, VL) = (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN TMP_SRC[i+63:i] := SRC[63:0];
ELSE TMP_SRC[i+63:i] := SRC[i+63:i];
FI;
ENDFOR;
IF VL = 256
TMP_DEST[63:0] := (TMP_SRC2[255:0] >> (SRC1[1:0] * 64))[63:0];
TMP_DEST[127:64] := (TMP_SRC2[255:0] >> (SRC1[65:64] * 64))[63:0];
TMP_DEST[191:128] := (TMP_SRC2[255:0] >> (SRC1[129:128] * 64))[63:0];
TMP_DEST[255:192] := (TMP_SRC2[255:0] >> (SRC1[193:192] * 64))[63:0];
FI;
IF VL = 512
TMP_DEST[63:0] := (TMP_SRC2[511:0] >> (SRC1[2:0] * 64))[63:0];
TMP_DEST[127:64] := (TMP_SRC2[511:0] >> (SRC1[66:64] * 64))[63:0];
TMP_DEST[191:128] := (TMP_SRC2[511:0] >> (SRC1[130:128] * 64))[63:0];
TMP_DEST[255:192] := (TMP_SRC2[511:0] >> (SRC1[194:192] * 64))[63:0];
TMP_DEST[319:256] := (TMP_SRC2[511:0] >> (SRC1[258:256] * 64))[63:0];
TMP_DEST[383:320] := (TMP_SRC2[511:0] >> (SRC1[322:320] * 64))[63:0];
TMP_DEST[447:384] := (TMP_SRC2[511:0] >> (SRC1[386:384] * 64))[63:0];
TMP_DEST[511:448] := (TMP_SRC2[511:0] >> (SRC1[450:448] * 64))[63:0];
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0 ;zeroing-masking
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions”; additionally:
#UD If VEX.L = 0.
If VEX.vvvv != 1111B.
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions”; additionally:
#UD If encoded with EVEX.128.
If EVEX.vvvv != 1111B and with imm8.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Copies doubleword elements of single precision floating-point values from the second source operand (the third
operand) to the destination operand (the first operand) according to the indices in the first source operand (the
second operand). Note that this instruction permits a doubleword in the source operand to be copied to more than
one location in the destination operand.
VEX.256 versions: The first and second operands are YMM registers, the third operand can be a YMM register or
memory location. Bits (MAXVL-1:256) of the corresponding destination register are zeroed.
EVEX encoded version: The first and second operands are ZMM registers, the third operand can be a ZMM register,
a 512-bit memory location or a 512-bit vector broadcasted from a 32-bit memory location. The elements in the
destination are updated using the writemask k1.
If VPERMPS is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will cause an
#UD exception.
Operation
VPERMPS (EVEX forms)
(KL, VL) (8, 256),= (16, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN TMP_SRC2[i+31:i] := SRC2[31:0];
ELSE TMP_SRC2[i+31:i] := SRC2[i+31:i];
FI;
ENDFOR;
IF VL = 256
TMP_DEST[31:0] := (TMP_SRC2[255:0] >> (SRC1[2:0] * 32))[31:0];
TMP_DEST[63:32] := (TMP_SRC2[255:0] >> (SRC1[34:32] * 32))[31:0];
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
Additionally:
#UD If VEX.L = 0.
EVEX-encoded instruction, see Table 2-52, “Type E4NF Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
The imm8 version: Copies quadwords from the source operand (the second operand) to the destination operand
(the first operand) according to the indices specified by the immediate operand (the third operand). Each two-bit
value in the immediate byte selects a qword element in the source operand.
VEX version: The source operand can be a YMM register or a memory location. Bits (MAXVL-1:256) of the corre-
sponding destination register are zeroed.
In EVEX.512 encoded version, The elements in the destination are updated using the writemask k1 and the imm8
bits are reused as control bits for the upper 256-bit half when the control bits are coming from immediate. The
source operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 64-bit
memory location.
Immediate control versions: VEX.vvvv and EVEX.vvvv are reserved and must be 1111b otherwise instructions will
#UD.
The vector control version: Copies quadwords from the second source operand (the third operand) to the destina-
tion operand (the first operand) according to the indices in the first source operand (the second operand). The first
3 bits of each 64 bit element in the index operand selects which quadword in the second source operand to copy.
The first and second operands are ZMM registers, the third operand can be a ZMM register, a 512-bit memory loca-
tion or a 512-bit vector broadcasted from a 64-bit memory location. The elements in the destination are updated
using the writemask k1.
Note that this instruction permits a qword in the source operand to be copied to multiple locations in the destination
operand.
Operation
VPERMQ (EVEX - imm8 control forms)
(KL, VL) = (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN TMP_SRC[i+63:i] := SRC[63:0];
ELSE TMP_SRC[i+63:i] := SRC[i+63:i];
FI;
ENDFOR;
TMP_DEST[63:0] := (TMP_SRC[255:0] >> (IMM8[1:0] * 64))[63:0];
TMP_DEST[127:64] := (TMP_SRC[255:0] >> (IMM8[3:2] * 64))[63:0];
TMP_DEST[191:128] := (TMP_SRC[255:0] >> (IMM8[5:4] * 64))[63:0];
TMP_DEST[255:192] := (TMP_SRC[255:0] >> (IMM8[7:6] * 64))[63:0];
IF VL >= 512
TMP_DEST[319:256] := (TMP_SRC[511:256] >> (IMM8[1:0] * 64))[63:0];
TMP_DEST[383:320] := (TMP_SRC[511:256] >> (IMM8[3:2] * 64))[63:0];
TMP_DEST[447:384] := (TMP_SRC[511:256] >> (IMM8[5:4] * 64))[63:0];
TMP_DEST[511:448] := (TMP_SRC[511:256] >> (IMM8[7:6] * 64))[63:0];
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := TMP_DEST[i+63:i]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0 ;zeroing-masking
FI;
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Permutes byte values from two tables, comprising of the first operand (also the destination operand) and the third
operand (the second source operand). The second operand (the first source operand) provides byte indices to
select byte results from the two tables. The selected byte elements are written to the destination at byte granu-
larity under the writemask k1.
The first and second operands are ZMM/YMM/XMM registers. The second operand contains input indices to select
elements from the two input tables in the 1st and 3rd operands. The first operand is also the destination of the
result. The second source operand can be a ZMM/YMM/XMM register, or a 512/256/128-bit memory location. In
each index byte, the id bit for table selection is bit 6/5/4, and bits [5:0]/[4:0]/[3:0] selects element within each
input table.
Note that these instructions permit a byte value in the source operands to be copied to more than one location in
the destination operand. Also, the second table and the indices can be reused in subsequent iterations, but the first
table is overwritten.
Bits (MAX_VL-1:256/128) of the destination are zeroed for VL=256,128.
VPERMT2B—Full Permute of Bytes From Two Tables Overwriting a Table Vol. 2C 5-513
Operation
VPERMT2B (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
IF VL = 128:
id := 3;
ELSE IF VL = 256:
id := 4;
ELSE IF VL = 512:
id := 5;
FI;
TMP_DEST[VL-1:0] := DEST[VL-1:0];
FOR j := 0 TO KL-1
off := 8*SRC1[j*8 + id: j*8] ;
IF k1[j] OR *no writemask*:
DEST[j*8 + 7: j*8] := SRC1[j*8+id+1]? SRC2[off+7:off] : TMP_DEST[off+7:off];
ELSE IF *zeroing-masking*
DEST[j*8 + 7: j*8] := 0;
*ELSE
DEST[j*8 + 7: j*8] remains unchanged*
FI;
ENDFOR
DEST[MAX_VL-1:VL] := 0;
Other Exceptions
See Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
VPERMT2B—Full Permute of Bytes From Two Tables Overwriting a Table Vol. 2C 5-514
VPERMT2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting One Table
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 7D /r A V/V (AVX512VL AND Permute word integers from two tables in
VPERMT2W xmm1 {k1}{z}, xmm2, AVX512BW) OR xmm3/m128 and xmm1 using indexes in xmm2 and
xmm3/m128 AVX10.11 store the result in xmm1 using writemask k1.
EVEX.256.66.0F38.W1 7D /r A V/V (AVX512VL AND Permute word integers from two tables in
VPERMT2W ymm1 {k1}{z}, ymm2, AVX512BW) OR ymm3/m256 and ymm1 using indexes in ymm2 and
ymm3/m256 AVX10.11 store the result in ymm1 using writemask k1.
EVEX.512.66.0F38.W1 7D /r A V/V AVX512BW Permute word integers from two tables in
VPERMT2W zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512 and zmm1 using indexes in zmm2 and
zmm3/m512 store the result in zmm1 using writemask k1.
EVEX.128.66.0F38.W0 7E /r B V/V (AVX512VL AND Permute double-words from two tables in
VPERMT2D xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m32bcst and xmm1 using indexes in
xmm3/m128/m32bcst AVX10.11 xmm2 and store the result in xmm1 using
writemask k1.
EVEX.256.66.0F38.W0 7E /r B V/V (AVX512VL AND Permute double-words from two tables in
VPERMT2D ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m32bcst and ymm1 using indexes in
ymm3/m256/m32bcst AVX10.11 ymm2 and store the result in ymm1 using
writemask k1.
EVEX.512.66.0F38.W0 7E /r B V/V AVX512F Permute double-words from two tables in
VPERMT2D zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m32bcst and zmm1 using indices in
zmm3/m512/m32bcst zmm2 and store the result in zmm1 using
writemask k1.
EVEX.128.66.0F38.W1 7E /r B V/V (AVX512VL AND Permute quad-words from two tables in
VPERMT2Q xmm1 {k1}{z}, xmm2, AVX512F) OR xmm3/m128/m64bcst and xmm1 using indexes in
xmm3/m128/m64bcst AVX10.11 xmm2 and store the result in xmm1 using
writemask k1.
EVEX.256.66.0F38.W1 7E /r B V/V (AVX512VL AND Permute quad-words from two tables in
VPERMT2Q ymm1 {k1}{z}, ymm2, AVX512F) OR ymm3/m256/m64bcst and ymm1 using indexes in
ymm3/m256/m64bcst AVX10.11 ymm2 and store the result in ymm1 using
writemask k1.
EVEX.512.66.0F38.W1 7E /r B V/V AVX512F Permute quad-words from two tables in
VPERMT2Q zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512/m64bcst and zmm1 using indices in
zmm3/m512/m64bcst zmm2 and store the result in zmm1 using
writemask k1.
EVEX.128.66.0F38.W0 7F /r B V/V (AVX512VL AND Permute single-precision floating-point values from
VPERMT2PS xmm1 {k1}{z}, AVX512F) OR two tables in xmm3/m128/m32bcst and xmm1
xmm2, xmm3/m128/m32bcst AVX10.11 using indexes in xmm2 and store the result in xmm1
using writemask k1.
EVEX.256.66.0F38.W0 7F /r B V/V (AVX512VL AND Permute single-precision floating-point values from
VPERMT2PS ymm1 {k1}{z}, AVX512F) OR two tables in ymm3/m256/m32bcst and ymm1
ymm2, ymm3/m256/m32bcst AVX10.11 using indexes in ymm2 and store the result in ymm1
using writemask k1.
EVEX.512.66.0F38.W0 7F /r B V/V AVX512F Permute single-precision floating-point values from
VPERMT2PS zmm1 {k1}{z}, OR AVX10.11 two tables in zmm3/m512/m32bcst and zmm1
zmm2, zmm3/m512/m32bcst using indices in zmm2 and store the result in zmm1
using writemask k1.
VPERMT2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting One Table Vol. 2C 5-515
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 7F /r B V/V (AVX512VL AND Permute double precision floating-point values from
VPERMT2PD xmm1 {k1}{z}, AVX512F) OR two tables in xmm3/m128/m64bcst and xmm1
xmm2, xmm3/m128/m64bcst AVX10.11 using indexes in xmm2 and store the result in xmm1
using writemask k1.
EVEX.256.66.0F38.W1 7F /r B V/V (AVX512VL AND Permute double precision floating-point values from
VPERMT2PD ymm1 {k1}{z}, AVX512F) OR two tables in ymm3/m256/m64bcst and ymm1
ymm2, ymm3/m256/m64bcst AVX10.11 using indexes in ymm2 and store the result in ymm1
using writemask k1.
EVEX.512.66.0F38.W1 7F /r B V/V AVX512F Permute double precision floating-point values from
VPERMT2PD zmm1 {k1}{z}, OR AVX10.11 two tables in zmm3/m512/m64bcst and zmm1
zmm2, zmm3/m512/m64bcst using indices in zmm2 and store the result in zmm1
using writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Permutes 16-bit/32-bit/64-bit values in the first operand and the third operand (the second source operand) using
indices in the second operand (the first source operand) to select elements from the first and third operands. The
selected elements are written to the destination operand (the first operand) according to the writemask k1.
The first and second operands are ZMM/YMM/XMM registers. The second operand contains input indices to select
elements from the two input tables in the 1st and 3rd operands. The first operand is also the destination of the
result.
D/Q/PS/PD element versions: The second source operand can be a ZMM/YMM/XMM register, a 512/256/128-bit
memory location or a 512/256/128-bit vector broadcasted from a 32/64-bit memory location. Broadcast from the
low 32/64-bit memory location is performed if EVEX.b and the id bit for table selection are set (selecting table_2).
Dword/PS versions: The id bit for table selection is bit 4/3/2, depending on VL=512, 256, 128. Bits
[3:0]/[2:0]/[1:0] of each element in the input index vector select an element within the two source operands, If
the id bit is 0, table_1 (the first source) is selected; otherwise the second source operand is selected.
Qword/PD versions: The id bit for table selection is bit 3/2/1, and bits [2:0]/[1:0] /bit 0 selects element within each
input table.
Word element versions: The second source operand can be a ZMM/YMM/XMM register, or a 512/256/128-bit
memory location. The id bit for table selection is bit 5/4/3, and bits [4:0]/[3:0]/[2:0] selects element within each
input table.
Note that these instructions permit a 16-bit/32-bit/64-bit value in the source operands to be copied to more than
one location in the destination operand. Note also that in this case, the same index can be reused for example for
a second iteration, while the table elements being permuted are overwritten.
Bits (MAXVL-1:256/128) of the destination are zeroed for VL=256,128.
VPERMT2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting One Table Vol. 2C 5-516
Operation
VPERMT2W (EVEX encoded versions)
(KL, VL) = (8, 128), (16, 256), (32, 512)
IF VL = 128
id := 2
FI;
IF VL = 256
id := 3
FI;
IF VL = 512
id := 4
FI;
TMP_DEST := DEST
FOR j := 0 TO KL-1
i := j * 16
off := 16*SRC1[i+id:i]
IF k1[j] OR *no writemask*
THEN
DEST[i+15:i]=SRC1[i+id+1] ? SRC2[off+15:off]
: TMP_DEST[off+15:off]
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VPERMT2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting One Table Vol. 2C 5-517
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
VPERMT2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting One Table Vol. 2C 5-518
VPERMT2D __m256i _mm256_mask2_permutex2var_epi32(__m256i a, __m256i idx, __mmask8 k, __m256i b);
VPERMT2D __m256i _mm256_maskz_permutex2var_epi32(__mmask8 k, __m256i a, __m256i idx, __m256i b);
VPERMT2D __m128i _mm_permutex2var_epi32(__m128i a, __m128i idx, __m128i b);
VPERMT2D __m128i _mm_mask_permutex2var_epi32(__m128i a, __mmask8 k, __m128i idx, __m128i b);
VPERMT2D __m128i _mm_mask2_permutex2var_epi32(__m128i a, __m128i idx, __mmask8 k, __m128i b);
VPERMT2D __m128i _mm_maskz_permutex2var_epi32(__mmask8 k, __m128i a, __m128i idx, __m128i b);
VPERMT2PD __m512d _mm512_permutex2var_pd(__m512d a, __m512i idx, __m512d b);
VPERMT2PD __m512d _mm512_mask_permutex2var_pd(__m512d a, __mmask8 k, __m512i idx, __m512d b);
VPERMT2PD __m512d _mm512_mask2_permutex2var_pd(__m512d a, __m512i idx, __mmask8 k, __m512d b);
VPERMT2PD __m512d _mm512_maskz_permutex2var_pd(__mmask8 k, __m512d a, __m512i idx, __m512d b);
VPERMT2PD __m256d _mm256_permutex2var_pd(__m256d a, __m256i idx, __m256d b);
VPERMT2PD __m256d _mm256_mask_permutex2var_pd(__m256d a, __mmask8 k, __m256i idx, __m256d b);
VPERMT2PD __m256d _mm256_mask2_permutex2var_pd(__m256d a, __m256i idx, __mmask8 k, __m256d b);
VPERMT2PD __m256d _mm256_maskz_permutex2var_pd(__mmask8 k, __m256d a, __m256i idx, __m256d b);
VPERMT2PD __m128d _mm_permutex2var_pd(__m128d a, __m128i idx, __m128d b);
VPERMT2PD __m128d _mm_mask_permutex2var_pd(__m128d a, __mmask8 k, __m128i idx, __m128d b);
VPERMT2PD __m128d _mm_mask2_permutex2var_pd(__m128d a, __m128i idx, __mmask8 k, __m128d b);
VPERMT2PD __m128d _mm_maskz_permutex2var_pd(__mmask8 k, __m128d a, __m128i idx, __m128d b);
VPERMT2PS __m512 _mm512_permutex2var_ps(__m512 a, __m512i idx, __m512 b);
VPERMT2PS __m512 _mm512_mask_permutex2var_ps(__m512 a, __mmask16 k, __m512i idx, __m512 b);
VPERMT2PS __m512 _mm512_mask2_permutex2var_ps(__m512 a, __m512i idx, __mmask16 k, __m512 b);
VPERMT2PS __m512 _mm512_maskz_permutex2var_ps(__mmask16 k, __m512 a, __m512i idx, __m512 b);
VPERMT2PS __m256 _mm256_permutex2var_ps(__m256 a, __m256i idx, __m256 b);
VPERMT2PS __m256 _mm256_mask_permutex2var_ps(__m256 a, __mmask8 k, __m256i idx, __m256 b);
VPERMT2PS __m256 _mm256_mask2_permutex2var_ps(__m256 a, __m256i idx, __mmask8 k, __m256 b);
VPERMT2PS __m256 _mm256_maskz_permutex2var_ps(__mmask8 k, __m256 a, __m256i idx, __m256 b);
VPERMT2PS __m128 _mm_permutex2var_ps(__m128 a, __m128i idx, __m128 b);
VPERMT2PS __m128 _mm_mask_permutex2var_ps(__m128 a, __mmask8 k, __m128i idx, __m128 b);
VPERMT2PS __m128 _mm_mask2_permutex2var_ps(__m128 a, __m128i idx, __mmask8 k, __m128 b);
VPERMT2PS __m128 _mm_maskz_permutex2var_ps(__mmask8 k, __m128 a, __m128i idx, __m128 b);
VPERMT2Q __m512i _mm512_permutex2var_epi64(__m512i a, __m512i idx, __m512i b);
VPERMT2Q __m512i _mm512_mask_permutex2var_epi64(__m512i a, __mmask8 k, __m512i idx, __m512i b);
VPERMT2Q __m512i _mm512_mask2_permutex2var_epi64(__m512i a, __m512i idx, __mmask8 k, __m512i b);
VPERMT2Q __m512i _mm512_maskz_permutex2var_epi64(__mmask8 k, __m512i a, __m512i idx, __m512i b);
VPERMT2Q __m256i _mm256_permutex2var_epi64(__m256i a, __m256i idx, __m256i b);
VPERMT2Q __m256i _mm256_mask_permutex2var_epi64(__m256i a, __mmask8 k, __m256i idx, __m256i b);
VPERMT2Q __m256i _mm256_mask2_permutex2var_epi64(__m256i a, __m256i idx, __mmask8 k, __m256i b);
VPERMT2Q __m256i _mm256_maskz_permutex2var_epi64(__mmask8 k, __m256i a, __m256i idx, __m256i b);
VPERMT2Q __m128i _mm_permutex2var_epi64(__m128i a, __m128i idx, __m128i b);
VPERMT2Q __m128i _mm_mask_permutex2var_epi64(__m128i a, __mmask8 k, __m128i idx, __m128i b);
VPERMT2Q __m128i _mm_mask2_permutex2var_epi64(__m128i a, __m128i idx, __mmask8 k, __m128i b);
VPERMT2Q __m128i _mm_maskz_permutex2var_epi64(__mmask8 k, __m128i a, __m128i idx, __m128i b);
VPERMT2W __m512i _mm512_permutex2var_epi16(__m512i a, __m512i idx, __m512i b);
VPERMT2W __m512i _mm512_mask_permutex2var_epi16(__m512i a, __mmask32 k, __m512i idx, __m512i b);
VPERMT2W __m512i _mm512_mask2_permutex2var_epi16(__m512i a, __m512i idx, __mmask32 k, __m512i b);
VPERMT2W __m512i _mm512_maskz_permutex2var_epi16(__mmask32 k, __m512i a, __m512i idx, __m512i b);
VPERMT2W __m256i _mm256_permutex2var_epi16(__m256i a, __m256i idx, __m256i b);
VPERMT2W __m256i _mm256_mask_permutex2var_epi16(__m256i a, __mmask16 k, __m256i idx, __m256i b);
VPERMT2W __m256i _mm256_mask2_permutex2var_epi16(__m256i a, __m256i idx, __mmask16 k, __m256i b);
VPERMT2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting One Table Vol. 2C 5-519
VPERMT2W __m256i _mm256_maskz_permutex2var_epi16(__mmask16 k, __m256i a, __m256i idx, __m256i b);
VPERMT2W __m128i _mm_permutex2var_epi16(__m128i a, __m128i idx, __m128i b);
VPERMT2W __m128i _mm_mask_permutex2var_epi16(__m128i a, __mmask8 k, __m128i idx, __m128i b);
VPERMT2W __m128i _mm_mask2_permutex2var_epi16(__m128i a, __m128i idx, __mmask8 k, __m128i b);
VPERMT2W __m128i _mm_maskz_permutex2var_epi16(__mmask8 k, __m128i a, __m128i idx, __m128i b);
Other Exceptions
VPERMT2D/Q/PS/PD: See Table 2-52, “Type E4NF Class Exception Conditions.”
VPERMT2W: See Exceptions Type E4NF.nb in Table 2-52, “Type E4NF Class Exception Conditions.”
VPERMT2W/D/Q/PS/PD—Full Permute From Two Tables Overwriting One Table Vol. 2C 5-520
VPEXPANDB/VPEXPANDW—Expand Byte/Word Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 62 /r A V/V (AVX512_VBMI2 Expands up to 128 bits of packed byte values
VPEXPANDB xmm1{k1}{z}, m128 AND AVX512VL) from m128 to xmm1 with writemask k1.
OR AVX10.11
EVEX.128.66.0F38.W0 62 /r B V/V (AVX512_VBMI2 Expands up to 128 bits of packed byte values
VPEXPANDB xmm1{k1}{z}, xmm2 AND AVX512VL) from xmm2 to xmm1 with writemask k1.
OR AVX10.11
EVEX.256.66.0F38.W0 62 /r A V/V (AVX512_VBMI2 Expands up to 256 bits of packed byte values
VPEXPANDB ymm1{k1}{z}, m256 AND AVX512VL) from m256 to ymm1 with writemask k1.
OR AVX10.11
EVEX.256.66.0F38.W0 62 /r B V/V (AVX512_VBMI2 Expands up to 256 bits of packed byte values
VPEXPANDB ymm1{k1}{z}, ymm2 AND AVX512VL) from ymm2 to ymm1 with writemask k1.
OR AVX10.11
EVEX.512.66.0F38.W0 62 /r A V/V AVX512_VBMI2 Expands up to 512 bits of packed byte values
VPEXPANDB zmm1{k1}{z}, m512 OR AVX10.11 from m512 to zmm1 with writemask k1.
EVEX.512.66.0F38.W0 62 /r B V/V AVX512_VBMI2 Expands up to 512 bits of packed byte values
VPEXPANDB zmm1{k1}{z}, zmm2 OR AVX10.11 from zmm2 to zmm1 with writemask k1.
EVEX.128.66.0F38.W1 62 /r A V/V (AVX512_VBMI2 Expands up to 128 bits of packed word values
VPEXPANDW xmm1{k1}{z}, m128 AND AVX512VL) from m128 to xmm1 with writemask k1.
OR AVX10.11
EVEX.128.66.0F38.W1 62 /r B V/V (AVX512_VBMI2 Expands up to 128 bits of packed word values
VPEXPANDW xmm1{k1}{z}, xmm2 AND AVX512VL) from xmm2 to xmm1 with writemask k1.
OR AVX10.11
EVEX.256.66.0F38.W1 62 /r A V/V (AVX512_VBMI2 Expands up to 256 bits of packed word values
VPEXPANDW ymm1{k1}{z}, m256 AND AVX512VL) from m256 to ymm1 with writemask k1.
OR AVX10.11
EVEX.256.66.0F38.W1 62 /r B V/V (AVX512_VBMI2 Expands up to 256 bits of packed word values
VPEXPANDW ymm1{k1}{z}, ymm2 AND AVX512VL) from ymm2 to ymm1 with writemask k1.
OR AVX10.11
EVEX.512.66.0F38.W1 62 /r A V/V AVX512_VBMI2 Expands up to 512 bits of packed word values
VPEXPANDW zmm1{k1}{z}, m512 OR AVX10.11 from m512 to zmm1 with writemask k1.
EVEX.512.66.0F38.W1 62 /r B V/V AVX512_VBMI2 Expands up to 512 bits of packed byte integer
VPEXPANDW zmm1{k1}{z}, zmm2 OR AVX10.11 values from zmm2 to zmm1 with writemask
k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Operation
VPEXPANDB
(KL, VL) = (16, 128), (32, 256), (64, 512)
k := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.byte[j] := SRC.byte[k];
k := k + 1
ELSE:
IF *merging-masking*:
*DEST.byte[j] remains unchanged*
ELSE: ; zeroing-masking
DEST.byte[j] := 0
DEST[MAX_VL-1:VL] := 0
VPEXPANDW
(KL, VL) = (8,128), (16,256), (32, 512)
k := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.word[j] := SRC.word[k];
k := k + 1
ELSE:
IF *merging-masking*:
*DEST.word[j] remains unchanged*
ELSE: ; zeroing-masking
DEST.word[j] := 0
DEST[MAX_VL-1:VL] := 0
Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Expand (load) up to 16 contiguous doubleword integer values of the input vector in the source operand (the second
operand) to sparse elements in the destination operand (the first operand), selected by the writemask k1. The
destination operand is a ZMM register, the source operand can be a ZMM register or memory location.
The input vector starts from the lowest element in the source operand. The opmask register k1 selects the desti-
nation elements (a partial vector or sparse elements if less than 8 elements) to be replaced by the ascending
elements in the input vector. Destination elements not selected by the writemask k1 are either unmodified or
zeroed, depending on EVEX.z.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.
Operation
VPEXPANDD (EVEX encoded versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
k := 0
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
DEST[i+31:i] := SRC[k+31:k];
k := k + 32
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
VPEXPANDD—Load Sparse Packed Doubleword Integer Values From Dense Memory/Register Vol. 2C 5-523
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VPEXPANDD—Load Sparse Packed Doubleword Integer Values From Dense Memory/Register Vol. 2C 5-524
VPEXPANDQ—Load Sparse Packed Quadword Integer Values From Dense Memory/Register
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 89 /r A V/V (AVX512VL AND Expand packed quad-word integer values
VPEXPANDQ xmm1 {k1}{z}, xmm2/m128 AVX512F) OR from xmm2/m128 to xmm1 using
AVX10.11 writemask k1.
EVEX.256.66.0F38.W1 89 /r A V/V (AVX512VL AND Expand packed quad-word integer values
VPEXPANDQ ymm1 {k1}{z}, ymm2/m256 AVX512F) OR from ymm2/m256 to ymm1 using
AVX10.11 writemask k1.
EVEX.512.66.0F38.W1 89 /r A V/V AVX512F Expand packed quad-word integer values
VPEXPANDQ zmm1 {k1}{z}, zmm2/m512 OR AVX10.11 from zmm2/m512 to zmm1 using writemask
k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Expand (load) up to 8 quadword integer values from the source operand (the second operand) to sparse elements
in the destination operand (the first operand), selected by the writemask k1. The destination operand is a ZMM
register, the source operand can be a ZMM register or memory location.
The input vector starts from the lowest element in the source operand. The opmask register k1 selects the desti-
nation elements (a partial vector or sparse elements if less than 8 elements) to be replaced by the ascending
elements in the input vector. Destination elements not selected by the writemask k1 are either unmodified or
zeroed, depending on EVEX.z.
Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element
instead of the size of the full vector.
Operation
VPEXPANDQ (EVEX encoded versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
k := 0
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN
DEST[i+63:i] := SRC[k+63:k];
k := k + 64
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[i+63:i] := 0
FI
FI;
VPEXPANDQ—Load Sparse Packed Quadword Integer Values From Dense Memory/Register Vol. 2C 5-525
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instruction, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
VPEXPANDQ—Load Sparse Packed Quadword Integer Values From Dense Memory/Register Vol. 2C 5-526
VPGATHERDD/VPGATHERDQ—Gather Packed Dword, Packed Qword With Signed Dword Indices
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 90 /vsib A V/V (AVX512VL AND Using signed dword indices, gather dword values
VPGATHERDD xmm1 {k1}, vm32x AVX512F) OR from memory using writemask k1 for merging-
AVX10.11 masking.
EVEX.256.66.0F38.W0 90 /vsib A V/V (AVX512VL AND Using signed dword indices, gather dword values
VPGATHERDD ymm1 {k1}, vm32y AVX512F) OR from memory using writemask k1 for merging-
AVX10.11 masking.
EVEX.512.66.0F38.W0 90 /vsib A V/V AVX512F Using signed dword indices, gather dword values
VPGATHERDD zmm1 {k1}, vm32z OR AVX10.11 from memory using writemask k1 for merging-
masking.
EVEX.128.66.0F38.W1 90 /vsib A V/V (AVX512VL AND Using signed dword indices, gather quadword values
VPGATHERDQ xmm1 {k1}, vm32x AVX512F) OR from memory using writemask k1 for merging-
AVX10.11 masking.
EVEX.256.66.0F38.W1 90 /vsib A V/V (AVX512VL AND Using signed dword indices, gather quadword values
VPGATHERDQ ymm1 {k1}, vm32x AVX512F) OR from memory using writemask k1 for merging-
AVX10.11 masking.
EVEX.512.66.0F38.W1 90 /vsib A V/V AVX512F Using signed dword indices, gather quadword values
VPGATHERDQ zmm1 {k1}, vm32y OR AVX10.11 from memory using writemask k1 for merging-
masking.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
A set of 16 or 8 doubleword/quadword memory locations pointed to by base address BASE_ADDR and index vector
VINDEX with scale SCALE are gathered. The result is written into vector zmm1. The elements are specified via the
VSIB (i.e., the index register is a zmm, holding packed indices). Elements will only be loaded if their corresponding
mask bit is one. If an element’s mask bit is not set, the corresponding element of the destination register (zmm1)
is left unchanged. The entire mask register will be set to zero by this instruction unless it triggers an exception.
This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception
is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination
register and the mask register (k1) are partially updated; those elements that have been gathered are placed into
the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already
gathered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruc-
tion breakpoint is not re-triggered when the instruction is continued.
If the data element size is less than the index element size, the higher part of the destination register and the mask
register do not correspond to any elements being gathered. This instruction sets those higher parts to zero. It may
update these unused elements to one or both of those registers even if the instruction triggers an exception, and
even if the instruction triggers the exception before gathering any elements.
Note that:
• The values may be read from memory in any order. Memory ordering with other instructions follows the Intel-
64 memory-ordering model.
VPGATHERDD/VPGATHERDQ—Gather Packed Dword, Packed Qword With Signed Dword Indices Vol. 2C 5-527
• Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all
elements closer to the LSB of the destination zmm will be completed (and non-faulting). Individual elements
closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered
in the conventional order.
• Elements may be gathered in any order, but faults must be delivered in a right-to-left order; thus, elements to
the left of a faulting one may be gathered before the fault is delivered. A given implementation of this
instruction is repeatable - given the same input values and architectural state, the same set of elements to the
left of the faulting one will be gathered.
• This instruction does not perform AC checks, and so will never deliver an AC fault.
• Not valid with 16-bit effective addresses. Will deliver a #UD fault.
• These instructions do not accept zeroing-masking since the 0 values in k1 are used to determine completion.
Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if
ModRM.rm is different than 100b.
This instruction has the same disp8*N and alignment rules as for scalar instructions (Tuple 1).
The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX. The instruction
will #UD fault if the k0 mask register is specified.
The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit
mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are
ignored.
Operation
BASE_ADDR stands for the memory operand base address (a GPR); may not exist
VINDEX stands for the memory operand vector of indices (a ZMM register)
SCALE stands for the memory operand scalar (1, 2, 4 or 8)
DISP is the optional 1 or 4 byte displacement
VPGATHERDD/VPGATHERDQ—Gather Packed Dword, Packed Qword With Signed Dword Indices Vol. 2C 5-528
DEST[MAXVL-1:VL] := 0
Other Exceptions
See Table 2-63, “Type E12 Class Exception Conditions.”
VPGATHERDD/VPGATHERDQ—Gather Packed Dword, Packed Qword With Signed Dword Indices Vol. 2C 5-529
VPGATHERQD/VPGATHERQQ—Gather Packed Dword, Packed Qword with Signed Qword Indices
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 91 /vsib A V/V (AVX512VL AND Using signed qword indices, gather dword values
VPGATHERQD xmm1 {k1}, vm64x AVX512F) OR from memory using writemask k1 for merging-
AVX10.11 masking.
EVEX.256.66.0F38.W0 91 /vsib A V/V (AVX512VL AND Using signed qword indices, gather dword values
VPGATHERQD xmm1 {k1}, vm64y AVX512F) OR from memory using writemask k1 for merging-
AVX10.11 masking.
EVEX.512.66.0F38.W0 91 /vsib A V/V AVX512F Using signed qword indices, gather dword values
VPGATHERQD ymm1 {k1}, vm64z OR AVX10.11 from memory using writemask k1 for merging-
masking.
EVEX.128.66.0F38.W1 91 /vsib A V/V (AVX512VL AND Using signed qword indices, gather quadword
VPGATHERQQ xmm1 {k1}, vm64x AVX512F) OR values from memory using writemask k1 for
AVX10.11 merging-masking.
EVEX.256.66.0F38.W1 91 /vsib A V/V (AVX512VL AND Using signed qword indices, gather quadword
VPGATHERQQ ymm1 {k1}, vm64y AVX512F) OR values from memory using writemask k1 for
AVX10.11 merging-masking.
EVEX.512.66.0F38.W1 91 /vsib A V/V AVX512F Using signed qword indices, gather quadword
VPGATHERQQ zmm1 {k1}, vm64z OR AVX10.11 values from memory using writemask k1 for
merging-masking.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
A set of 8 doubleword/quadword memory locations pointed to by base address BASE_ADDR and index vector
VINDEX with scale SCALE are gathered. The result is written into a vector register. The elements are specified via
the VSIB (i.e., the index register is a vector register, holding packed indices). Elements will only be loaded if their
corresponding mask bit is one. If an element’s mask bit is not set, the corresponding element of the destination
register is left unchanged. The entire mask register will be set to zero by this instruction unless it triggers an excep-
tion.
This instruction can be suspended by an exception if at least one element is already gathered (i.e., if the exception
is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination
register and the mask register (k1) are partially updated; those elements that have been gathered are placed into
the destination register and have their mask bits set to zero. If any traps or interrupts are pending from already
gathered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruc-
tion breakpoint is not re-triggered when the instruction is continued.
If the data element size is less than the index element size, the higher part of the destination register and the mask
register do not correspond to any elements being gathered. This instruction sets those higher parts to zero. It may
update these unused elements to one or both of those registers even if the instruction triggers an exception, and
even if the instruction triggers the exception before gathering any elements.
Note that:
VPGATHERQD/VPGATHERQQ—Gather Packed Dword, Packed Qword with Signed Qword Indices Vol. 2C 5-538
• The values may be read from memory in any order. Memory ordering with other instructions follows the Intel-
64 memory-ordering model.
• Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all
elements closer to the LSB of the destination zmm will be completed (and non-faulting). Individual elements
closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered
in the conventional order.
• Elements may be gathered in any order, but faults must be delivered in a right-to-left order; thus, elements to
the left of a faulting one may be gathered before the fault is delivered. A given implementation of this
instruction is repeatable - given the same input values and architectural state, the same set of elements to the
left of the faulting one will be gathered.
• This instruction does not perform AC checks, and so will never deliver an AC fault.
• Not valid with 16-bit effective addresses. Will deliver a #UD fault.
• These instructions do not accept zeroing-masking since the 0 values in k1 are used to determine completion.
Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if
ModRM.rm is different than 100b.
This instruction has the same disp8*N and alignment rules as for scalar instructions (Tuple 1).
The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX. The instruction
will #UD fault if the k0 mask register is specified.
The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit
mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are
ignored.
Operation
BASE_ADDR stands for the memory operand base address (a GPR); may not exist
VINDEX stands for the memory operand vector of indices (a ZMM register)
SCALE stands for the memory operand scalar (1, 2, 4 or 8)
DISP is the optional 1 or 4 byte displacement
VPGATHERQD/VPGATHERQQ—Gather Packed Dword, Packed Qword with Signed Qword Indices Vol. 2C 5-539
ENDFOR
k1[MAX_KL-1:KL] := 0
DEST[MAXVL-1:VL] := 0
Other Exceptions
See Table 2-63, “Type E12 Class Exception Conditions.”
VPGATHERQD/VPGATHERQQ—Gather Packed Dword, Packed Qword with Signed Qword Indices Vol. 2C 5-540
VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 44 /r A V/V (AVX512VL AND Count the number of leading zero bits in each dword
VPLZCNTD xmm1 {k1}{z}, AVX512CD) OR element of xmm2/m128/m32bcst using writemask k1.
xmm2/m128/m32bcst AVX10.11
EVEX.256.66.0F38.W0 44 /r A V/V (AVX512VL AND Count the number of leading zero bits in each dword
VPLZCNTD ymm1 {k1}{z}, AVX512CD) OR element of ymm2/m256/m32bcst using writemask k1.
ymm2/m256/m32bcst AVX10.11
EVEX.512.66.0F38.W0 44 /r A V/V AVX512CD Count the number of leading zero bits in each dword
VPLZCNTD zmm1 {k1}{z}, OR AVX10.11 element of zmm2/m512/m32bcst using writemask k1.
zmm2/m512/m32bcst
EVEX.128.66.0F38.W1 44 /r A V/V (AVX512VL AND Count the number of leading zero bits in each qword
VPLZCNTQ xmm1 {k1}{z}, AVX512CD) OR element of xmm2/m128/m64bcst using writemask k1.
xmm2/m128/m64bcst AVX10.11
EVEX.256.66.0F38.W1 44 /r A V/V (AVX512VL AND Count the number of leading zero bits in each qword
VPLZCNTQ ymm1 {k1}{z}, AVX512CD) OR element of ymm2/m256/m64bcst using writemask k1.
ymm2/m256/m64bcst AVX10.11
EVEX.512.66.0F38.W1 44 /r A V/V AVX512CD Count the number of leading zero bits in each qword
VPLZCNTQ zmm1 {k1}{z}, OR AVX10.11 element of zmm2/m512/m64bcst using writemask k1.
zmm2/m512/m64bcst
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Counts the number of leading most significant zero bits in each dword or qword element of the source operand (the
second operand) and stores the results in the destination register (the first operand) according to the writemask.
If an element is zero, the result for that element is the operand size of the element.
EVEX.512 encoded version: The source operand is a ZMM register, a 512-bit memory location, or a 512-bit vector
broadcasted from a 32/64-bit memory location. The destination operand is a ZMM register, conditionally updated
using writemask k1.
EVEX.256 encoded version: The source operand is a YMM register, a 256-bit memory location, or a 256-bit vector
broadcasted from a 32/64-bit memory location. The destination operand is a YMM register, conditionally updated
using writemask k1.
EVEX.128 encoded version: The source operand is a XMM register, a 128-bit memory location, or a 128-bit vector
broadcasted from a 32/64-bit memory location. The destination operand is a XMM register, conditionally updated
using writemask k1.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values Vol. 2C 5-541
Operation
VPLZCNTD
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j*32
IF MaskBit(j) OR *no writemask*
THEN
temp := 32
DEST[i+31:i] := 0
WHILE (temp > 0) AND (SRC[i+temp-1] = 0)
DO
temp := temp – 1
DEST[i+31:i] := DEST[i+31:i] + 1
OD
ELSE
IF *merging-masking*
THEN *DEST[i+31:i] remains unchanged*
ELSE DEST[i+31:i] := 0
FI
FI
ENDFOR
DEST[MAXVL-1:VL] := 0
VPLZCNTQ
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j*64
IF MaskBit(j) OR *no writemask*
THEN
temp := 64
DEST[i+63:i] := 0
WHILE (temp > 0) AND (SRC[i+temp-1] = 0)
DO
temp := temp – 1
DEST[i+63:i] := DEST[i+63:i] + 1
OD
ELSE
IF *merging-masking*
THEN *DEST[i+63:i] remains unchanged*
ELSE DEST[i+63:i] := 0
FI
FI
ENDFOR
DEST[MAXVL-1:VL] := 0
VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values Vol. 2C 5-542
Intel C/C++ Compiler Intrinsic Equivalent
Other Exceptions
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values Vol. 2C 5-543
VPMADD52HUQ—Packed Multiply of Unsigned 52-Bit Unsigned Integers and Add High 52-Bit
Products to 64-Bit Accumulators
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W1 B5 /r A V/V AVX512_IFMA Multiply unsigned 52-bit integers in xmm2 and
VPMADD52HUQ xmm1, xmm2, xmm3/m128 and add the high 52 bits of the
xmm3/m128 104-bit product to the qword unsigned integers
in xmm1.
VEX.256.66.0F38.W1 B5 /r A V/V AVX512_IFMA Multiply unsigned 52-bit integers in ymm2 and
VPMADD52HUQ ymm1, ymm2, ymm3/m256 and add the high 52 bits of the
ymm3/m256 104-bit product to the qword unsigned integers
in ymm1.
EVEX.128.66.0F38.W1 B5 /r B V/V (AVX512_IFMA Multiply unsigned 52-bit integers in xmm2 and
VPMADD52HUQ xmm1 {k1}{z}, xmm2, AND AVX512VL) xmm3/m128 and add the high 52 bits of the
xmm3/m128/m64bcst OR AVX10.11 104-bit product to the qword unsigned integers
in xmm1 using writemask k1.
EVEX.256.66.0F38.W1 B5 /r B V/V (AVX512_IFMA Multiply unsigned 52-bit integers in ymm2 and
VPMADD52HUQ ymm1 {k1}{z}, ymm2, AND AVX512VL) ymm3/m256 and add the high 52 bits of the
ymm3/m256/m64bcst OR AVX10.11 104-bit product to the qword unsigned integers
in ymm1 using writemask k1.
EVEX.512.66.0F38.W1 B5 /r B V/V AVX512_IFMA Multiply unsigned 52-bit integers in zmm2 and
VPMADD52HUQ zmm1 {k1}{z}, zmm2, OR AVX10.11 zmm3/m512 and add the high 52 bits of the
zmm3/m512/m64bcst 104-bit product to the qword unsigned integers
in zmm1 using writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Multiplies packed unsigned 52-bit integers in each qword element of the first source operand (the second oper-
and) with the packed unsigned 52-bit integers in the corresponding elements of the second source operand (the
third operand) to form packed 104-bit intermediate results. The high 52-bit, unsigned integer of each 104-bit
product is added to the corresponding qword unsigned integer of the destination operand (the first operand)
under the writemask k1.
The first source operand is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM reg-
ister, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory loca-
tion. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1 at 64-bit
granularity.
VPMADD52HUQ—Packed Multiply of Unsigned 52-Bit Unsigned Integers and Add High 52-Bit Products to 64-Bit Accumulators Vol. 2C 5-544
Operation
VPMADDHUQ srcdest, src1, src2 (VEX version)
VL = (128,256)
KL = VL/64
FOR i in 0 .. KL-1:
temp128 := zeroextend64(src1.qword[i][51:0]) *zeroextend64(src2.qword[i][51:0])
srcdest.qword[i] := srcdest.qword[i] +zeroextend64(temp128[103:52])
srcdest[MAXVL:VL] := 0
Flags Affected
None.
None.
VPMADD52HUQ—Packed Multiply of Unsigned 52-Bit Unsigned Integers and Add High 52-Bit Products to 64-Bit Accumulators Vol. 2C 5-545
Other Exceptions
VEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-51, “Type E4 Class Exception Conditions.”
VPMADD52HUQ—Packed Multiply of Unsigned 52-Bit Unsigned Integers and Add High 52-Bit Products to 64-Bit Accumulators Vol. 2C 5-546
VPMADD52LUQ—Packed Multiply of Unsigned 52-Bit Integers and Add the Low 52-Bit Products
to Qword Accumulators
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En Bit Mode Flag
Support
VEX.128.66.0F38.W1 B4 /r A V/V AVX512_IFMA Multiply unsigned 52-bit integers in xmm2 and
VPMADD52LUQ xmm1, xmm2, xmm3/m128 and add the low 52 bits of the 104-bit
xmm3/m128 product to the qword unsigned integers in xmm1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Multiplies packed unsigned 52-bit integers in each qword element of the first source operand (the second oper-
and) with the packed unsigned 52-bit integers in the corresponding elements of the second source operand (the
third operand) to form packed 104-bit intermediate results. The low 52-bit, unsigned integer of each 104-bit
product is added to the corresponding qword unsigned integer of the destination operand (the first operand)
under the writemask k1.
The first source operand is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM reg-
ister, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory loca-
tion. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1 at 64-bit
granularity.
VPMADD52LUQ—Packed Multiply of Unsigned 52-Bit Integers and Add the Low 52-Bit Products to Qword Accumulators Vol. 2C 5-547
Operation
VPMADDLUQ srcdest, src1, src2 (VEX version)
VL = (128,256)
KL = VL/64
FOR i in 0 .. KL-1:
temp128 := zeroextend64(src1.qword[i][51:0]) *zeroextend64(src2.qword[i][51:0])
srcdest.qword[i] := srcdest.qword[i] +zeroextend64(temp128[51:0])
srcdest[MAXVL:VL] := 0
Flags Affected
None.
VPMADD52LUQ—Packed Multiply of Unsigned 52-Bit Integers and Add the Low 52-Bit Products to Qword Accumulators Vol. 2C 5-548
SIMD Floating-Point Exceptions
None.
Other Exceptions
VEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-51, “Type E4 Class Exception Conditions.”
VPMADD52LUQ—Packed Multiply of Unsigned 52-Bit Integers and Add the Low 52-Bit Products to Qword Accumulators Vol. 2C 5-549
VPMOVB2M/VPMOVW2M/VPMOVD2M/VPMOVQ2M—Convert a Vector Register to a Mask
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.F3.0F38.W0 29 /r RM V/V (AVX512VL AND Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVB2M k1, xmm1 AVX512BW) OR most significant bit of the corresponding byte in XMM1.
AVX10.11
EVEX.256.F3.0F38.W0 29 /r RM V/V (AVX512VL AND Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVB2M k1, ymm1 AVX512BW) OR most significant bit of the corresponding byte in YMM1.
AVX10.11
EVEX.512.F3.0F38.W0 29 /r RM V/V AVX512BW Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVB2M k1, zmm1 OR AVX10.11 most significant bit of the corresponding byte in ZMM1.
EVEX.128.F3.0F38.W1 29 /r RM V/V (AVX512VL AND Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVW2M k1, xmm1 AVX512BW) OR most significant bit of the corresponding word in XMM1.
AVX10.11
EVEX.256.F3.0F38.W1 29 /r RM V/V (AVX512VL AND Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVW2M k1, ymm1 AVX512BW) OR most significant bit of the corresponding word in YMM1.
AVX10.11
EVEX.512.F3.0F38.W1 29 /r RM V/V AVX512BW Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVW2M k1, zmm1 OR AVX10.11 most significant bit of the corresponding word in ZMM1.
EVEX.128.F3.0F38.W0 39 /r RM V/V (AVX512VL AND Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVD2M k1, xmm1 AVX512DQ) OR most significant bit of the corresponding doubleword in
AVX10.11 XMM1.
EVEX.256.F3.0F38.W0 39 /r RM V/V (AVX512VL AND Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVD2M k1, ymm1 AVX512DQ) OR most significant bit of the corresponding doubleword in
AVX10.11 YMM1.
EVEX.512.F3.0F38.W0 39 /r RM V/V AVX512DQ Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVD2M k1, zmm1 OR AVX10.11 most significant bit of the corresponding doubleword in
ZMM1.
EVEX.128.F3.0F38.W1 39 /r RM V/V (AVX512VL AND Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVQ2M k1, xmm1 AVX512DQ) OR most significant bit of the corresponding quadword in
AVX10.11 XMM1.
EVEX.256.F3.0F38.W1 39 /r RM V/V (AVX512VL AND Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVQ2M k1, ymm1 AVX512DQ) OR most significant bit of the corresponding quadword in
AVX10.11 YMM1.
EVEX.512.F3.0F38.W1 39 /r RM V/V AVX512DQ Sets each bit in k1 to 1 or 0 based on the value of the
VPMOVQ2M k1, zmm1 OR AVX10.11 most significant bit of the corresponding quadword in
ZMM1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts a vector register to a mask register. Each element in the destination register is set to 1 or 0 depending on
the value of most significant bit of the corresponding element in the source register.
Operation
VPMOVB2M (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF SRC[i+7]
THEN DEST[j] := 1
ELSE DEST[j] := 0
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0
Other Exceptions
EVEX-encoded instruction, see Table 2-57, “Type E7NM Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Operation
VPMOVDB instruction (EVEX encoded versions) when dest is a register
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 8
m := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := TruncateDoubleWordToByte (SRC[m+31:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/4] := 0;
Other Exceptions
EVEX-encoded instruction, see Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Operation
VPMOVDW instruction (EVEX encoded versions) when dest is a register
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 16
m := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TruncateDoubleWordToWord (SRC[m+31:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0;
Other Exceptions
EVEX-encoded instruction, see Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Converts a mask register to a vector register. Each element in the destination register is set to all 1’s or all 0’s
depending on the value of the corresponding bit in the source mask register.
The source operand is a mask register. The destination operand is a ZMM/YMM/XMM register.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Other Exceptions
EVEX-encoded instruction, see Table 2-57, “Type E7NM Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Operation
VPMOVQB instruction (EVEX encoded versions) when dest is a register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 8
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := TruncateQuadWordToByte (SRC[m+63:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/8] := 0;
Other Exceptions
EVEX-encoded instruction, see Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Operation
VPMOVQD instruction (EVEX encoded version) reg-reg form
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 32
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := TruncateQuadWordToDWord (SRC[m+63:m])
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0;
Other Exceptions
EVEX-encoded instruction, see Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Operation
VPMOVQW instruction (EVEX encoded versions) when dest is a register
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 16
m := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := TruncateQuadWordToWord (SRC[m+63:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/4] := 0;
Other Exceptions
EVEX-encoded instruction, see Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
VPMOVWB down converts 16-bit integers into packed bytes using truncation. VPMOVSWB converts signed 16-bit
integers into packed signed bytes using signed saturation. VPMOVUSWB convert unsigned word values into
unsigned byte values using unsigned saturation.
The source operand is a ZMM/YMM/XMM register. The destination operand is a YMM/XMM/XMM register or a
256/128/64-bit memory location.
Operation
VPMOVWB instruction (EVEX encoded versions) when dest is a register
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO Kl-1
i := j * 8
m := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+7:i] := TruncateWordToByte (SRC[m+15:m])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+7:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+7:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL/2] := 0;
Other Exceptions
EVEX-encoded instruction, see Table 2-55, “Type E6 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction selects eight unaligned bytes from each input qword element of the second source operand (the
third operand) and writes eight assembled bytes for each qword element in the destination operand (the first
operand). Each byte result is selected using a byte-granular shift control within the corresponding qword element
of the first source operand (the second operand). Each byte result in the destination operand is updated under the
writemask k1.
Only the low 6 bits of each control byte are used to select an 8-bit slot to extract the output byte from the qword
data in the second source operand. The starting bit of the 8-bit slot can be unaligned relative to any byte boundary
and is extracted from the input qword source at the location specified in the low 6-bit of the control byte. If the 8-
bit slot would exceed the qword boundary, the out-of-bound portion of the 8-bit slot is wrapped back to start from
bit 0 of the input qword element.
The first source operand is a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM reg-
ister, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory loca-
tion. The destination operand is a ZMM/YMM/XMM register.
Other Exceptions
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction counts the number of bits set to one in each byte, word, dword or qword element of its source (e.g.,
zmm2 or memory) and places the results in the destination register (zmm1). This instruction supports memory
fault suppression.
Operation
VPOPCNTB
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
DEST.byte[j] := POPCNT(SRC.byte[j])
ELSE IF *merging-masking*:
*DEST.byte[j] remains unchanged*
ELSE:
DEST.byte[j] := 0
DEST[MAX_VL-1:VL] := 0
VPOPCNTW
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
DEST.word[j] := POPCNT(SRC.word[j])
ELSE IF *merging-masking*:
*DEST.word[j] remains unchanged*
ELSE:
DEST.word[j] := 0
DEST[MAX_VL-1:VL] := 0
VPOPCNTD
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
IF SRC is broadcast memop:
t := SRC.dword[0]
ELSE:
t := SRC.dword[j]
DEST.dword[j] := POPCNT(t)
ELSE IF *merging-masking*:
*DEST..dword[j] remains unchanged*
ELSE:
DEST..dword[j] := 0
DEST[MAX_VL-1:VL] := 0
Other Exceptions
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Operation
LEFT_ROTATE_DWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC modulo 32;
DEST[31:0] := (SRC << COUNT) | (SRC >> (32 - COUNT));
LEFT_ROTATE_QWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC modulo 64;
DEST[63:0] := (SRC << COUNT) | (SRC >> (64 - COUNT));
Other Exceptions
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Operation
RIGHT_ROTATE_DWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC modulo 32;
DEST[31:0] := (SRC >> COUNT) | (SRC << (32 - COUNT));
RIGHT_ROTATE_QWORDS(SRC, COUNT_SRC)
COUNT := COUNT_SRC modulo 64;
DEST[63:0] := (SRC >> COUNT) | (SRC << (64 - COUNT));
Other Exceptions
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed Qword with Signed Dword, Signed Vol. 2C 5-596
Qword with Signed Dword, Signed Qword Indices
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 A0 /vsib A V/V (AVX512VL AND Using signed dword indices, scatter dword values
VPSCATTERDD vm32x {k1}, xmm1 AVX512F) OR to memory using writemask k1.
AVX10.11
EVEX.256.66.0F38.W0 A0 /vsib A V/V (AVX512VL AND Using signed dword indices, scatter dword values
VPSCATTERDD vm32y {k1}, ymm1 AVX512F) OR to memory using writemask k1.
AVX10.11
EVEX.512.66.0F38.W0 A0 /vsib A V/V AVX512F Using signed dword indices, scatter dword values
VPSCATTERDD vm32z {k1}, zmm1 OR AVX10.11 to memory using writemask k1.
EVEX.128.66.0F38.W1 A0 /vsib A V/V (AVX512VL AND Using signed dword indices, scatter qword values
VPSCATTERDQ vm32x {k1}, xmm1 AVX512F) OR to memory using writemask k1.
AVX10.11
EVEX.256.66.0F38.W1 A0 /vsib A V/V (AVX512VL AND Using signed dword indices, scatter qword values
VPSCATTERDQ vm32x {k1}, ymm1 AVX512F) OR to memory using writemask k1.
AVX10.11
EVEX.512.66.0F38.W1 A0 /vsib A V/V AVX512F Using signed dword indices, scatter qword values
VPSCATTERDQ vm32y {k1}, zmm1 OR AVX10.11 to memory using writemask k1.
EVEX.128.66.0F38.W0 A1 /vsib A V/V (AVX512VL AND Using signed qword indices, scatter dword values
VPSCATTERQD vm64x {k1}, xmm1 AVX512F) OR to memory using writemask k1.
AVX10.11
EVEX.256.66.0F38.W0 A1 /vsib A V/V (AVX512VL AND Using signed qword indices, scatter dword values
VPSCATTERQD vm64y {k1}, xmm1 AVX512F) OR to memory using writemask k1.
AVX10.11
EVEX.512.66.0F38.W0 A1 /vsib A V/V AVX512F Using signed qword indices, scatter dword values
VPSCATTERQD vm64z {k1}, ymm1 OR AVX10.11 to memory using writemask k1.
EVEX.128.66.0F38.W1 A1 /vsib A V/V (AVX512VL AND Using signed qword indices, scatter qword values
VPSCATTERQQ vm64x {k1}, xmm1 AVX512F) OR to memory using writemask k1.
AVX10.11
EVEX.256.66.0F38.W1 A1 /vsib A V/V (AVX512VL AND Using signed qword indices, scatter qword values
VPSCATTERQQ vm64y {k1}, ymm1 AVX512F) OR to memory using writemask k1.
AVX10.11
EVEX.512.66.0F38.W1 A1 /vsib A V/V AVX512F Using signed qword indices, scatter qword values
VPSCATTERQQ vm64z {k1}, zmm1 OR AVX10.11 to memory using writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Stores up to 16 elements (8 elements for qword indices) in doubleword vector or 8 elements in quadword vector to
the memory locations pointed by base address BASE_ADDR and index vector VINDEX, with scale SCALE. The
elements are specified via the VSIB (i.e., the index register is a vector register, holding packed indices). Elements
VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed Qword with Signed Dword, Signed Vol. 2C 5-597
will only be stored if their corresponding mask bit is one. The entire mask register will be set to zero by this instruc-
tion unless it triggers an exception.
This instruction can be suspended by an exception if at least one element is already scattered (i.e., if the exception
is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination
register and the mask register are partially updated. If any traps or interrupts are pending from already scattered
elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruction
breakpoint is not re-triggered when the instruction is continued.
Note that:
• Only writes to overlapping vector indices are guaranteed to be ordered with respect to each other (from LSB to
MSB of the source registers). Note that this also include partially overlapping vector indices. Writes that are not
overlapped may happen in any order. Memory ordering with other instructions follows the Intel-64 memory
ordering model. Note that this does not account for non-overlapping indices that map into the same physical
address locations.
• If two or more destination indices completely overlap, the “earlier” write(s) may be skipped.
• Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all
elements closer to the LSB of the destination ZMM will be completed (and non-faulting). Individual elements
closer to the MSB may or may not be completed. If a given element triggers multiple faults, they are delivered
in the conventional order.
• Elements may be scattered in any order, but faults must be delivered in a right-to left order; thus, elements to
the left of a faulting one may be gathered before the fault is delivered. A given implementation of this
instruction is repeatable - given the same input values and architectural state, the same set of elements to the
left of the faulting one will be gathered.
• This instruction does not perform AC checks, and so will never deliver an AC fault.
• Not valid with 16-bit effective addresses. Will deliver a #UD fault.
• If this instruction overwrites itself and then takes a fault, only a subset of elements may be completed before
the fault is delivered (as described above). If the fault handler completes and attempts to re-execute this
instruction, the new instruction will be executed, and the scatter will not complete.
Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if
ModRM.rm is different than 100b.
This instruction has special disp8*N and alignment rules. N is considered to be the size of a single vector element.
The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit
mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are
ignored.
The instruction will #UD fault if the k0 mask register is specified.
The instruction will #UD fault if EVEX.Z = 1.
Operation
BASE_ADDR stands for the memory operand base address (a GPR); may not exist
VINDEX stands for the memory operand vector of indices (a ZMM register)
SCALE stands for the memory operand scalar (1, 2, 4 or 8)
DISP is the optional 1 or 4 byte displacement
FI;
ENDFOR
VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed Qword with Signed Dword, Signed Vol. 2C 5-598
k1[MAX_KL-1:KL] := 0
FI;
ENDFOR
k1[MAX_KL-1:KL] := 0
VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed Qword with Signed Dword, Signed Vol. 2C 5-599
VPSCATTERQD void _mm256_mask_i64scatter_epi32(void * base, __mmask8 k, __m256i vdx, __m128i a, int scale);
VPSCATTERQD void _mm_mask_i64scatter_epi32(void * base, __mmask8 k, __m128i vdx, __m128i a, int scale);
VPSCATTERQQ void _mm512_i64scatter_epi64(void * base, __m512i vdx, __m512i a, int scale);
VPSCATTERQQ void _mm256_i64scatter_epi64(void * base, __m256i vdx, __m256i a, int scale);
VPSCATTERQQ void _mm_i64scatter_epi64(void * base, __m128i vdx, __m128i a, int scale);
VPSCATTERQQ void _mm512_mask_i64scatter_epi64(void * base, __mmask8 k, __m512i vdx, __m512i a, int scale);
VPSCATTERQQ void _mm256_mask_i64scatter_epi64(void * base, __mmask8 k, __m256i vdx, __m256i a, int scale);
VPSCATTERQQ void _mm_mask_i64scatter_epi64(void * base, __mmask8 k, __m128i vdx, __m128i a, int scale);
Other Exceptions
See Table 2-63, “Type E12 Class Exception Conditions.”
VPSCATTERDD/VPSCATTERDQ/VPSCATTERQD/VPSCATTERQQ—Scatter Packed Dword, Packed Qword with Signed Dword, Signed Vol. 2C 5-600
VPSHLD—Concatenate and Shift Packed Data Left Logical
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W1 70 /r /ib A V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDW xmm1{k1}{z}, xmm2, AND AVX512VL) extract result shifted to the left by constant
xmm3/m128, imm8 OR AVX10.11 value in imm8 into xmm1.
EVEX.256.66.0F3A.W1 70 /r /ib A V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDW ymm1{k1}{z}, ymm2, AND AVX512VL) extract result shifted to the left by constant
ymm3/m256, imm8 OR AVX10.11 value in imm8 into ymm1.
EVEX.512.66.0F3A.W1 70 /r /ib A V/V AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDW zmm1{k1}{z}, zmm2, OR AVX10.11 extract result shifted to the left by constant
zmm3/m512, imm8 value in imm8 into zmm1.
EVEX.128.66.0F3A.W0 71 /r /ib B V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDD xmm1{k1}{z}, xmm2, AND AVX512VL) extract result shifted to the left by constant
xmm3/m128/m32bcst, imm8 OR AVX10.11 value in imm8 into xmm1.
EVEX.256.66.0F3A.W0 71 /r /ib B V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDD ymm1{k1}{z}, ymm2, AND AVX512VL) extract result shifted to the left by constant
ymm3/m256/m32bcst, imm8 OR AVX10.11 value in imm8 into ymm1.
EVEX.512.66.0F3A.W0 71 /r /ib B V/V AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDD zmm1{k1}{z}, zmm2, OR AVX10.11 extract result shifted to the left by constant
zmm3/m512/m32bcst, imm8 value in imm8 into zmm1.
EVEX.128.66.0F3A.W1 71 /r /ib B V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDQ xmm1{k1}{z}, xmm2, AND AVX512VL) extract result shifted to the left by constant
xmm3/m128/m64bcst, imm8 OR AVX10.11 value in imm8 into xmm1.
EVEX.256.66.0F3A.W1 71 /r /ib B V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDQ ymm1{k1}{z}, ymm2, AND AVX512VL) extract result shifted to the left by constant
ymm3/m256/m64bcst, imm8 OR AVX10.11 value in imm8 into ymm1.
EVEX.512.66.0F3A.W1 71 /r /ib B V/V AVX512_VBMI2 Concatenate destination and source operands,
VPSHLDQ zmm1{k1}{z}, zmm2, OR AVX10.11 extract result shifted to the left by constant
zmm3/m512/m64bcst, imm8 value in imm8 into zmm1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Concatenate packed data, extract result shifted to the left by constant value.
This instruction supports memory fault suppression.
Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Concatenate packed data, extract result shifted to the left by variable value.
This instruction supports memory fault suppression.
VPSHLDV—Concatenate and Variable Shift Packed Data Left Logical Vol. 2C 5-603
Operation
FUNCTION concat(a,b):
IF words:
d.word[1] := a
d.word[0] := b
return d
ELSE IF dwords:
q.dword[1] := a
q.dword[0] := b
return q
ELSE IF qwords:
o.qword[1] := a
o.qword[0] := b
return o
VPSHLDV—Concatenate and Variable Shift Packed Data Left Logical Vol. 2C 5-604
VPSHLDVQ DEST, SRC2, SRC3
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1:
IF SRC3 is broadcast memop:
tsrc3 := SRC3.qword[0]
ELSE:
tsrc3 := SRC3.qword[j]
IF MaskBit(j) OR *no writemask*:
tmp := concat(DEST.qword[j], SRC2.qword[j]) << (tsrc3 & 63)
DEST.qword[j] := tmp.qword[1]
ELSE IF *zeroing*:
DEST.qword[j] := 0
*ELSE DEST.qword[j] remains unchanged*
DEST[MAX_VL-1:VL] := 0
Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
VPSHLDV—Concatenate and Variable Shift Packed Data Left Logical Vol. 2C 5-605
VPSHRD—Concatenate and Shift Packed Data Right Logical
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W1 72 /r /ib A V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDW xmm1{k1}{z}, xmm2, AND AVX512VL) extract result shifted to the right by constant
xmm3/m128, imm8 OR AVX10.11 value in imm8 into xmm1.
EVEX.256.66.0F3A.W1 72 /r /ib A V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDW ymm1{k1}{z}, ymm2, AND AVX512VL) extract result shifted to the right by constant
ymm3/m256, imm8 OR AVX10.11 value in imm8 into ymm1.
EVEX.512.66.0F3A.W1 72 /r /ib A V/V AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDW zmm1{k1}{z}, zmm2, OR AVX10.11 extract result shifted to the right by constant
zmm3/m512, imm8 value in imm8 into zmm1.
EVEX.128.66.0F3A.W0 73 /r /ib B V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDD xmm1{k1}{z}, xmm2, AND AVX512VL) extract result shifted to the right by constant
xmm3/m128/m32bcst, imm8 OR AVX10.11 value in imm8 into xmm1.
EVEX.256.66.0F3A.W0 73 /r /ib B V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDD ymm1{k1}{z}, ymm2, AND AVX512VL) extract result shifted to the right by constant
ymm3/m256/m32bcst, imm8 OR AVX10.11 value in imm8 into ymm1.
EVEX.512.66.0F3A.W0 73 /r /ib B V/V AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDD zmm1{k1}{z}, zmm2, OR AVX10.11 extract result shifted to the right by constant
zmm3/m512/m32bcst, imm8 value in imm8 into zmm1.
EVEX.128.66.0F3A.W1 73 /r /ib B V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDQ xmm1{k1}{z}, xmm2, AND AVX512VL) extract result shifted to the right by constant
xmm3/m128/m64bcst, imm8 OR AVX10.11 value in imm8 into xmm1.
EVEX.256.66.0F3A.W1 73 /r /ib B V/V (AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDQ ymm1{k1}{z}, ymm2, AND AVX512VL) extract result shifted to the right by constant
ymm3/m256/m64bcst, imm8 OR AVX10.11 value in imm8 into ymm1.
EVEX.512.66.0F3A.W1 73 /r /ib B V/V AVX512_VBMI2 Concatenate destination and source operands,
VPSHRDQ zmm1{k1}{z}, zmm2, OR AVX10.11 extract result shifted to the right by constant
zmm3/m512/m64bcst, imm8 value in imm8 into zmm1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Concatenate packed data, extract result shifted to the right by constant value.
This instruction supports memory fault suppression.
Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Concatenate packed data, extract result shifted to the right by variable value.
This instruction supports memory fault suppression.
VPSHRDV—Concatenate and Variable Shift Packed Data Right Logical Vol. 2C 5-609
Operation
VPSHRDVW DEST, SRC2, SRC3
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1:
IF MaskBit(j) OR *no writemask*:
DEST.word[j] := concat(SRC2.word[j], DEST.word[j]) >> (SRC3.word[j] & 15)
ELSE IF *zeroing*:
DEST.word[j] := 0
*ELSE DEST.word[j] remains unchanged*
DEST[MAX_VL-1:VL] := 0
VPSHRDV—Concatenate and Variable Shift Packed Data Right Logical Vol. 2C 5-610
Intel C/C++ Compiler Intrinsic Equivalent
VPSHRDVQ __m128i _mm_shrdv_epi64(__m128i, __m128i, __m128i);
VPSHRDVQ __m128i _mm_mask_shrdv_epi64(__m128i, __mmask8, __m128i, __m128i);
VPSHRDVQ __m128i _mm_maskz_shrdv_epi64(__mmask8, __m128i, __m128i, __m128i);
VPSHRDVQ __m256i _mm256_shrdv_epi64(__m256i, __m256i, __m256i);
VPSHRDVQ __m256i _mm256_mask_shrdv_epi64(__m256i, __mmask8, __m256i, __m256i);
VPSHRDVQ __m256i _mm256_maskz_shrdv_epi64(__mmask8, __m256i, __m256i, __m256i);
VPSHRDVQ __m512i _mm512_shrdv_epi64(__m512i, __m512i, __m512i);
VPSHRDVQ __m512i _mm512_mask_shrdv_epi64(__m512i, __mmask8, __m512i, __m512i);
VPSHRDVQ __m512i _mm512_maskz_shrdv_epi64(__mmask8, __m512i, __m512i, __m512i);
VPSHRDVD __m128i _mm_shrdv_epi32(__m128i, __m128i, __m128i);
VPSHRDVD __m128i _mm_mask_shrdv_epi32(__m128i, __mmask8, __m128i, __m128i);
VPSHRDVD __m128i _mm_maskz_shrdv_epi32(__mmask8, __m128i, __m128i, __m128i);
VPSHRDVD __m256i _mm256_shrdv_epi32(__m256i, __m256i, __m256i);
VPSHRDVD __m256i _mm256_mask_shrdv_epi32(__m256i, __mmask8, __m256i, __m256i);
VPSHRDVD __m256i _mm256_maskz_shrdv_epi32(__mmask8, __m256i, __m256i, __m256i);
VPSHRDVD __m512i _mm512_shrdv_epi32(__m512i, __m512i, __m512i);
VPSHRDVD __m512i _mm512_mask_shrdv_epi32(__m512i, __mmask16, __m512i, __m512i);
VPSHRDVD __m512i _mm512_maskz_shrdv_epi32(__mmask16, __m512i, __m512i, __m512i);
VPSHRDVW __m128i _mm_shrdv_epi16(__m128i, __m128i, __m128i);
VPSHRDVW __m128i _mm_mask_shrdv_epi16(__m128i, __mmask8, __m128i, __m128i);
VPSHRDVW __m128i _mm_maskz_shrdv_epi16(__mmask8, __m128i, __m128i, __m128i);
VPSHRDVW __m256i _mm256_shrdv_epi16(__m256i, __m256i, __m256i);
VPSHRDVW __m256i _mm256_mask_shrdv_epi16(__m256i, __mmask16, __m256i, __m256i);
VPSHRDVW __m256i _mm256_maskz_shrdv_epi16(__mmask16, __m256i, __m256i, __m256i);
VPSHRDVW __m512i _mm512_shrdv_epi16(__m512i, __m512i, __m512i);
VPSHRDVW __m512i _mm512_mask_shrdv_epi16(__m512i, __mmask32, __m512i, __m512i);
VPSHRDVW __m512i _mm512_maskz_shrdv_epi16(__mmask32, __m512i, __m512i, __m512i);
Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
VPSHRDV—Concatenate and Variable Shift Packed Data Right Logical Vol. 2C 5-611
VPSHUFBITQMB—Shuffle Bits From Quadword Elements Using Byte Indexes Into Mask
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 8F /r A V/V (AVX512_BITALG Extract values in xmm2 using control bits of
VPSHUFBITQMB k1{k2}, xmm2, AND AVX512VL) xmm3/m128 with writemask k2 and leave the
xmm3/m128 OR AVX10.11 result in mask register k1.
EVEX.256.66.0F38.W0 8F /r A V/V (AVX512_BITALG Extract values in ymm2 using control bits of
VPSHUFBITQMB k1{k2}, ymm2, AND AVX512VL) ymm3/m256 with writemask k2 and leave the
ymm3/m256 OR AVX10.11 result in mask register k1.
EVEX.512.66.0F38.W0 8F /r A V/V AVX512_BITALG Extract values in zmm2 using control bits of
VPSHUFBITQMB k1{k2}, zmm2, OR AVX10.11 zmm3/m512 with writemask k2 and leave the
zmm3/m512 result in mask register k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
The VPSHUFBITQMB instruction performs a bit gather select using second source as control and first source as
data. Each bit uses 6 control bits (2nd source operand) to select which data bit is going to be gathered (first source
operand). A given bit can only access 64 different bits of data (first 64 destination bits can access first 64 data bits,
second 64 destination bits can access second 64 data bits, etc.).
Control data for each output bit is stored in 8 bit elements of SRC2, but only the 6 least significant bits of each
element are used.
This instruction uses write masking (zeroing only). This instruction supports memory fault suppression.
The first source operand is a ZMM register. The second source operand is a ZMM register or a memory location. The
destination operand is a mask register.
Operation
VPSHUFBITQMB DEST, SRC1, SRC2
(KL, VL) = (16,128), (32,256), (64, 512)
FOR i := 0 TO KL/8-1: //Qword
FOR j := 0 to 7: // Byte
IF k2[i*8+j] or *no writemask*:
m := SRC2.qword[i].byte[j] & 0x3F
k1[i*8+j] := SRC1.qword[i].bit[m]
ELSE:
k1[i*8+j] := 0
k1[MAX_KL-1:KL] := 0
VPSHUFBITQMB—Shuffle Bits From Quadword Elements Using Byte Indexes Into Mask Vol. 2C 5-612
Intel C/C++ Compiler Intrinsic Equivalent
VPSHUFBITQMB __mmask16 _mm_bitshuffle_epi64_mask(__m128i, __m128i);
VPSHUFBITQMB __mmask16 _mm_mask_bitshuffle_epi64_mask(__mmask16, __m128i, __m128i);
VPSHUFBITQMB __mmask32 _mm256_bitshuffle_epi64_mask(__m256i, __m256i);
VPSHUFBITQMB __mmask32 _mm256_mask_bitshuffle_epi64_mask(__mmask32, __m256i, __m256i);
VPSHUFBITQMB __mmask64 _mm512_bitshuffle_epi64_mask(__m512i, __m512i);
VPSHUFBITQMB __mmask64 _mm512_mask_bitshuffle_epi64_mask(__mmask64, __m512i, __m512i);
VPSHUFBITQMB—Shuffle Bits From Quadword Elements Using Byte Indexes Into Mask Vol. 2C 5-613
VPSLLVW/VPSLLVD/VPSLLVQ—Variable Bit Shift Left Logical
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
VEX.128.66.0F38.W0 47 /r A V/V AVX2 Shift doublewords in xmm2 left by amount
VPSLLVD xmm1, xmm2, xmm3/m128 specified in the corresponding element of
xmm3/m128 while shifting in 0s.
VEX.128.66.0F38.W1 47 /r A V/V AVX2 Shift quadwords in xmm2 left by amount specified
VPSLLVQ xmm1, xmm2, xmm3/m128 in the corresponding element of xmm3/m128
while shifting in 0s.
VEX.256.66.0F38.W0 47 /r A V/V AVX2 Shift doublewords in ymm2 left by amount
VPSLLVD ymm1, ymm2, ymm3/m256 specified in the corresponding element of
ymm3/m256 while shifting in 0s.
VEX.256.66.0F38.W1 47 /r A V/V AVX2 Shift quadwords in ymm2 left by amount specified
VPSLLVQ ymm1, ymm2, ymm3/m256 in the corresponding element of ymm3/m256
while shifting in 0s.
EVEX.128.66.0F38.W1 12 /r B V/V (AVX512VL AND Shift words in xmm2 left by amount specified in
VPSLLVW xmm1 {k1}{z}, xmm2, AVX512BW) OR the corresponding element of xmm3/m128 while
xmm3/m128 AVX10.11 shifting in 0s using writemask k1.
EVEX.256.66.0F38.W1 12 /r B V/V (AVX512VL AND Shift words in ymm2 left by amount specified in
VPSLLVW ymm1 {k1}{z}, ymm2, AVX512BW) OR the corresponding element of ymm3/m256 while
ymm3/m256 AVX10.11 shifting in 0s using writemask k1.
EVEX.512.66.0F38.W1 12 /r B V/V AVX512BW Shift words in zmm2 left by amount specified in
VPSLLVW zmm1 {k1}{z}, zmm2, OR AVX10.11 the corresponding element of zmm3/m512 while
zmm3/m512 shifting in 0s using writemask k1.
EVEX.128.66.0F38.W0 47 /r C V/V (AVX512VL AND Shift doublewords in xmm2 left by amount
VPSLLVD xmm1 {k1}{z}, xmm2, AVX512F) OR specified in the corresponding element of
xmm3/m128/m32bcst AVX10.11 xmm3/m128/m32bcst while shifting in 0s using
writemask k1.
EVEX.256.66.0F38.W0 47 /r C V/V (AVX512VL AND Shift doublewords in ymm2 left by amount
VPSLLVD ymm1 {k1}{z}, ymm2, AVX512F) OR specified in the corresponding element of
ymm3/m256/m32bcst AVX10.11 ymm3/m256/m32bcst while shifting in 0s using
writemask k1.
EVEX.512.66.0F38.W0 47 /r C V/V AVX512F Shift doublewords in zmm2 left by amount
VPSLLVD zmm1 {k1}{z}, zmm2, OR AVX10.11 specified in the corresponding element of
zmm3/m512/m32bcst zmm3/m512/m32bcst while shifting in 0s using
writemask k1.
EVEX.128.66.0F38.W1 47 /r C V/V (AVX512VL AND Shift quadwords in xmm2 left by amount specified
VPSLLVQ xmm1 {k1}{z}, xmm2, AVX512F) OR in the corresponding element of
xmm3/m128/m64bcst AVX10.11 xmm3/m128/m64bcst while shifting in 0s using
writemask k1.
EVEX.256.66.0F38.W1 47 /r C V/V (AVX512VL AND Shift quadwords in ymm2 left by amount specified
VPSLLVQ ymm1 {k1}{z}, ymm2, AVX512F) OR in the corresponding element of
ymm3/m256/m64bcst AVX10.11 ymm3/m256/m64bcst while shifting in 0s using
writemask k1.
EVEX.512.66.0F38.W1 47 /r C V/V AVX512F Shift quadwords in zmm2 left by amount specified
VPSLLVQ zmm1 {k1}{z}, zmm2, OR AVX10.11 in the corresponding element of
zmm3/m512/m64bcst zmm3/m512/m64bcst while shifting in 0s using
writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Shifts the bits in the individual data elements (words, doublewords or quadword) in the first source operand to the
left by the count value of respective data elements in the second source operand. As the bits in the data elements
are shifted left, the empty low-order bits are cleared (set to 0).
The count values are specified individually in each data element of the second source operand. If the unsigned
integer value specified in the respective data element of the second source operand is greater than 15 (for word),
31 (for doublewords), or 63 (for a quadword), then the destination data element are written with 0.
VEX.128 encoded version: The destination and first source operands are XMM registers. The count operand can be
either an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination register
are zeroed.
VEX.256 encoded version: The destination and first source operands are YMM registers. The count operand can be
either an YMM register or a 256-bit memory. Bits (MAXVL-1:256) of the corresponding ZMM register are zeroed.
EVEX encoded VPSLLVD/Q: The destination and first source operands are ZMM/YMM/XMM registers. The count
operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512-bit vector broad-
casted from a 32/64-bit memory location. The destination is conditionally updated with writemask k1.
EVEX encoded VPSLLVW: The destination and first source operands are ZMM/YMM/XMM registers. The count
operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination is condition-
ally updated with writemask k1.
Operation
VPSLLVW (EVEX encoded version)
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := ZeroExtend(SRC1[i+15:i] << SRC2[i+15:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0;
Other Exceptions
VEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPSLLVD/VPSLLVQ, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX-encoded VPSLLVW, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Shifts the bits in the individual data elements (word/doublewords/quadword) in the first source operand (the
second operand) to the right by the number of bits specified in the count value of respective data elements in the
second source operand (the third operand). As the bits in the data elements are shifted right, the empty high-order
bits are set to the MSB (sign extension).
The count values are specified individually in each data element of the second source operand. If the unsigned
integer value specified in the respective data element of the second source operand is greater than 15 (for words),
31 (for doublewords), or 63 (for a quadword), then the destination data element is filled with the corresponding
sign bit of the source element.
VEX.128 encoded version: The destination and first source operands are XMM registers. The count operand can be
either an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination register
are zeroed.
VEX.256 encoded version: The destination and first source operands are YMM registers. The count operand can be
either an YMM register or a 256-bit memory. Bits (MAXVL-1:256) of the corresponding destination register are
zeroed.
EVEX.512/256/128 encoded VPSRAVD/W: The destination and first source operands are ZMM/YMM/XMM registers.
The count operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a
512/256/128-bit vector broadcasted from a 32/64-bit memory location. The destination is conditionally updated
with writemask k1.
EVEX.512/256/128 encoded VPSRAVQ: The destination and first source operands are ZMM/YMM/XMM registers.
The count operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination
is conditionally updated with writemask k1.
Operation
VPSRAVW (EVEX encoded version)
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN
COUNT := SRC2[i+3:i]
IF COUNT < 16
THEN DEST[i+15:i] := SignExtend(SRC1[i+15:i] >> COUNT)
ELSE
FOR k := 0 TO 15
DEST[i+k] := SRC1[i+15]
ENDFOR;
FI
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
Other Exceptions
Non-EVEX-encoded instruction, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Shifts the bits in the individual data elements (words, doublewords or quadword) in the first source operand to the
right by the count value of respective data elements in the second source operand. As the bits in the data elements
are shifted right, the empty high-order bits are cleared (set to 0).
The count values are specified individually in each data element of the second source operand. If the unsigned
integer value specified in the respective data element of the second source operand is greater than 15 (for word),
31 (for doublewords), or 63 (for a quadword), then the destination data element are written with 0.
VEX.128 encoded version: The destination and first source operands are XMM registers. The count operand can be
either an XMM register or a 128-bit memory location. Bits (MAXVL-1:128) of the corresponding destination register
are zeroed.
VEX.256 encoded version: The destination and first source operands are YMM registers. The count operand can be
either an YMM register or a 256-bit memory. Bits (MAXVL-1:256) of the corresponding ZMM register are zeroed.
EVEX encoded VPSRLVD/Q: The destination and first source operands are ZMM/YMM/XMM registers. The count
operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512-bit vector broad-
casted from a 32/64-bit memory location. The destination is conditionally updated with writemask k1.
EVEX encoded VPSRLVW: The destination and first source operands are ZMM/YMM/XMM registers. The count
operand can be either a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination is condition-
ally updated with writemask k1.
Operation
VPSRLVW (EVEX encoded version)
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j * 16
IF k1[j] OR *no writemask*
THEN DEST[i+15:i] := ZeroExtend(SRC1[i+15:i] >> SRC2[i+15:i])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+15:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+15:i] := 0
FI
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0;
Other Exceptions
VEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded VPSRLVD/Q, see Table 2-51, “Type E4 Class Exception Conditions.”
EVEX-encoded VPSRLVW, see Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Table 5-20. Examples of VPTERNLOGD/Q Imm8 Boolean Function and Input Index Values
VPTERNLOGD reg1, reg2, src3, 0xE2 Bit Result with VPTERNLOGD reg1, reg2, src3, 0xE4 Bit Result with
Imm8=0xE2 Imm8=0xE4
Bit(reg1) Bit(reg2) Bit(src3) Bit(reg1) Bit(reg2) Bit(src3)
0 0 0 0 0 0 0 0
0 0 1 1 0 0 1 0
0 1 0 0 0 1 0 1
0 1 1 0 0 1 1 0
1 0 0 0 1 0 0 0
1 0 1 1 1 0 1 1
1 1 0 1 1 1 0 1
1 1 1 1 1 1 1 1
Specifying different values in imm8 will allow any arbitrary three-input Boolean functions to be implemented in
software using VPTERNLOGD/Q. Table 5-1 and Table 5-2 provide a mapping of all 256 possible imm8 values to
various Boolean expressions.
Operation
VPTERNLOGD (EVEX encoded versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN
FOR k := 0 TO 31
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[j][k] := imm[(DEST[i+k] << 2) + (SRC1[ i+k ] << 1) + SRC2[ k ]]
ELSE DEST[j][k] := imm[(DEST[i+k] << 2) + (SRC1[ i+k ] << 1) + SRC2[ i+k ]]
FI;
; table lookup of immediate bellow;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31+i:i] remains unchanged*
ELSE ; zeroing-masking
DEST[31+i:i] := 0
FI;
FI;
ENDFOR;
Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
Description
Performs a bitwise logical AND operation on the first source operand (the second operand) and second source
operand (the third operand) and stores the result in the destination operand (the first operand) under the write-
mask. Each bit of the result is set to 1 if the bitwise AND of the corresponding elements of the first and second src
operands is non-zero; otherwise it is set to 0.
VPTESTMD/VPTESTMQ: The first source operand is a ZMM/YMM/XMM register. The second source operand can be a
ZMM/YMM/XMM register, a 512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a
32/64-bit memory location. The destination operand is a mask register updated under the writemask.
VPTESTMB/VPTESTMW: The first source operand is a ZMM/YMM/XMM register. The second source operand can be a
ZMM/YMM/XMM register or a 512/256/128-bit memory location. The destination operand is a mask register
updated under the writemask.
Operation
VPTESTMB (EVEX encoded versions)
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j * 8
IF k1[j] OR *no writemask*
THEN DEST[j] := (SRC1[i+7:i] BITWISE AND SRC2[i+7:i] != 0)? 1 : 0;
ELSE DEST[j] = 0 ; zeroing-masking only
FI;
ENDFOR
DEST[MAX_KL-1:KL] := 0
Other Exceptions
VPTESTMD/Q: See Table 2-51, “Type E4 Class Exception Conditions.”
VPTESTMB/W: See Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
Description
Performs a bitwise logical NAND operation on the byte/word/doubleword/quadword element of the first source
operand (the second operand) with the corresponding element of the second source operand (the third operand)
and stores the logical comparison result into each bit of the destination operand (the first operand) according to the
writemask k1. Each bit of the result is set to 1 if the bitwise AND of the corresponding elements of the first and
second src operands is zero; otherwise it is set to 0.
EVEX encoded VPTESTNMD/Q: The first source operand is a ZMM/YMM/XMM registers. The second source operand
can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted
from a 32/64-bit memory location. The destination is updated according to the writemask.
EVEX encoded VPTESTNMB/W: The first source operand is a ZMM/YMM/XMM registers. The second source operand
can be a ZMM/YMM/XMM register, a 512/256/128-bit memory location. The destination is updated according to the
writemask.
Operation
VPTESTNMB
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j := 0 TO KL-1
i := j*8
IF MaskBit(j) OR *no writemask*
THEN
DEST[j] := (SRC1[i+7:i] BITWISE AND SRC2[i+7:i] == 0)? 1 : 0
ELSE DEST[j] := 0; zeroing masking only
FI
ENDFOR
DEST[MAX_KL-1:KL] := 0
VPTESTNMW
(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j := 0 TO KL-1
i := j*16
IF MaskBit(j) OR *no writemask*
THEN
DEST[j] := (SRC1[i+15:i] BITWISE AND SRC2[i+15:i] == 0)? 1 : 0
ELSE DEST[j] := 0; zeroing masking only
FI
ENDFOR
DEST[MAX_KL-1:KL] := 0
VPTESTNMQ
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j*64
IF MaskBit(j) OR *no writemask*
THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[j] := (SRC1[i+63:i] BITWISE AND SRC2[63:0] == 0)? 1 : 0;
ELSE DEST[j] := (SRC1[i+63:i] BITWISE AND SRC2[i+63:i] == 0)? 1 : 0;
FI;
ELSE DEST[j] := 0; zeroing masking only
FI
ENDFOR
DEST[MAX_KL-1:KL] := 0
Other Exceptions
VPTESTNMD/VPTESTNMQ: See Table 2-51, “Type E4 Class Exception Conditions.”
VPTESTNMB/VPTESTNMW: See Exceptions Type E4.nb in Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction calculates 2/4/8 range operation outputs from two sets of packed input double precision floating-
point values in the first source operand (the second operand) and the second source operand (the third operand).
The range outputs are written to the destination operand (the first operand) under the writemask k1.
Bits7:4 of imm8 byte must be zero. The range operation output is performed in two parts, each configured by a
two-bit control field within imm8[3:0]:
• Imm8[1:0] specifies the initial comparison operation to be one of max, min, max absolute value or min
absolute value of the input value pair. Each comparison of two input values produces an intermediate result
that combines with the sign selection control (imm8[3:2]) to determine the final range operation output.
• Imm8[3:2] specifies the sign of the range operation output to be one of the following: from the first input
value, from the comparison result, set or clear.
The encodings of imm8[1:0] and imm8[3:2] are shown in Figure 5-27.
VRANGEPD—Range Restriction Calculation for Packed Pairs of Float64 Values Vol. 2C 5-637
7 6 5 4 3 2 1 0
When one or more of the input value is a NAN, the comparison operation may signal invalid exception (IE). Details
with one of more input value is NAN is listed in Table 5-21. If the comparison raises an IE, the sign select control
(imm8[3:2]) has no effect to the range operation output; this is indicated also in Table 5-21.
When both input values are zeros of opposite signs, the comparison operation of MIN/MAX in the range compare
operation is slightly different from the conceptually similar floating-point MIN/MAX operation that are found in the
instructions VMAXPD/VMINPD. The details of MIN/MAX/MIN_ABS/MAX_ABS operation for VRANGEPD/PS/SD/SS
for magnitude-0, opposite-signed input cases are listed in Table 5-22.
Additionally, non-zero, equal-magnitude with opposite-sign input values perform MIN_ABS or MAX_ABS compar-
ison operation with result listed in Table 5-23.
Table 5-21. Signaling of Comparison Operation of One or More NaN Input Values and Effect of Imm8[3:2]
Src1 Src2 Result IE Signaling Due to Comparison Imm8[3:2] Effect to Range Output
sNaN1 sNaN2 Quiet(sNaN1) Yes Ignored
sNaN1 qNaN2 Quiet(sNaN1) Yes Ignored
sNaN1 Norm2 Quiet(sNaN1) Yes Ignored
qNaN1 sNaN2 Quiet(sNaN2) Yes Ignored
qNaN1 qNaN2 qNaN1 No Applicable
qNaN1 Norm2 Norm2 No Applicable
Norm1 sNaN2 Quiet(sNaN2) Yes Ignored
Norm1 qNaN2 Norm1 No Applicable
Table 5-22. Comparison Result for Opposite-Signed Zero Cases for MIN, MIN_ABS, and MAX, MAX_ABS
MIN and MIN_ABS MAX and MAX_ABS
Src1 Src2 Result Src1 Src2 Result
+0 -0 -0 +0 -0 +0
-0 +0 -0 -0 +0 +0
Table 5-23. Comparison Result of Equal-Magnitude Input Cases for MIN_ABS and MAX_ABS, (|a| = |b|, a>0, b<0)
MIN_ABS (|a| = |b|, a>0, b<0) MAX_ABS (|a| = |b|, a>0, b<0)
Src1 Src2 Result Src1 Src2 Result
a b b a b a
b a b b a a
VRANGEPD—Range Restriction Calculation for Packed Pairs of Float64 Values Vol. 2C 5-638
Operation
RangeDP(SRC1[63:0], SRC2[63:0], CmpOpCtl[1:0], SignSelCtl[1:0])
{
// Check if SNAN and report IE, see also Table 5-21
IF (SRC1 = SNAN) THEN RETURN (QNAN(SRC1), set IE);
IF (SRC2 = SNAN) THEN RETURN (QNAN(SRC2), set IE);
Src1.exp := SRC1[62:52];
Src1.fraction := SRC1[51:0];
IF ((Src1.exp = 0 ) and (Src1.fraction != 0)) THEN// Src1 is a denormal number
IF DAZ THEN Src1.fraction := 0;
ELSE IF (SRC2 <> QNAN) Set DE; FI;
FI;
Src2.exp := SRC2[62:52];
Src2.fraction := SRC2[51:0];
IF ((Src2.exp = 0) and (Src2.fraction !=0 )) THEN// Src2 is a denormal number
IF DAZ THEN Src2.fraction := 0;
ELSE IF (SRC1 <> QNAN) Set DE; FI;
FI;
Case(SignSelCtl[1:0])
00: dest := (SRC1[63] << 63) OR (TMP[62:0]);// Preserve Src1 sign bit
01: dest := TMP[63:0];// Preserve sign of compare result
10: dest := (0 << 63) OR (TMP[62:0]);// Zero out sign bit
11: dest := (1 << 63) OR (TMP[62:0]);// Set the sign bit
ESAC;
RETURN dest[63:0];
}
CmpOpCtl[1:0]= imm8[1:0];
SignSelCtl[1:0]=imm8[3:2];
VRANGEPD—Range Restriction Calculation for Packed Pairs of Float64 Values Vol. 2C 5-639
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] = 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
The following example describes a common usage of this instruction for checking that the input operand is
bounded between ±1023.
VRANGEPD zmm_dst, zmm_src, zmm_1023, 02h;
Where:
zmm_dst is the destination operand.
zmm_src is the input operand to compare against ±1023 (this is SRC1).
zmm_1023 is the reference operand, contains the value of 1023 (and this is SRC2).
IMM=02(imm8[1:0]='10) selects the Min Absolute value operation with selection of SRC1.sign.
In case |zmm_src| < 1023 (i.e., SRC1 is smaller than 1023 in magnitude), then its value will be written into
zmm_dst. Otherwise, the value stored in zmm_dst will get the value of 1023 (received on zmm_1023, which is
SRC2).
However, the sign control (imm8[3:2]='00) instructs to select the sign of SRC1 received from zmm_src. So, even
in the case of |zmm_src| ≥ 1023, the selected sign of SRC1 is kept.
Thus, if zmm_src < -1023, the result of VRANGEPD will be the minimal value of -1023 while if zmm_src > +1023,
the result of VRANGE will be the maximal value of +1023.
Invalid, Denormal.
Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
VRANGEPD—Range Restriction Calculation for Packed Pairs of Float64 Values Vol. 2C 5-640
VRANGEPS—Range Restriction Calculation for Packed Pairs of Float32 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W0 50 /r ib A V/V (AVX512VL Calculate four RANGE operation output value from
VRANGEPS xmm1 {k1}{z}, xmm2, AND AVX512DQ) 4 pairs of single-precision floating-point values in
xmm3/m128/m32bcst, imm8 OR AVX10.11 xmm2 and xmm3/m128/m32bcst, store the results
to xmm1 under the writemask k1. Imm8 specifies
the comparison and sign of the range operation.
EVEX.256.66.0F3A.W0 50 /r ib A V/V (AVX512VL Calculate eight RANGE operation output value from
VRANGEPS ymm1 {k1}{z}, ymm2, AND AVX512DQ) 8 pairs of single-precision floating-point values in
ymm3/m256/m32bcst, imm8 OR AVX10.11 ymm2 and ymm3/m256/m32bcst, store the results
to ymm1 under the writemask k1. Imm8 specifies
the comparison and sign of the range operation.
EVEX.512.66.0F3A.W0 50 /r ib A V/V AVX512DQ Calculate 16 RANGE operation output value from
VRANGEPS zmm1 {k1}{z}, zmm2, OR AVX10.11 16 pairs of single-precision floating-point values in
zmm3/m512/m32bcst{sae}, imm8 zmm2 and zmm3/m512/m32bcst, store the results
to zmm1 under the writemask k1. Imm8 specifies
the comparison and sign of the range operation.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction calculates 4/8/16 range operation outputs from two sets of packed input single precision floating-
point values in the first source operand (the second operand) and the second source operand (the third operand).
The range outputs are written to the destination operand (the first operand) under the writemask k1.
Bits7:4 of imm8 byte must be zero. The range operation output is performed in two parts, each configured by a
two-bit control field within imm8[3:0]:
• Imm8[1:0] specifies the initial comparison operation to be one of max, min, max absolute value or min
absolute value of the input value pair. Each comparison of two input values produces an intermediate result
that combines with the sign selection control (imm8[3:2]) to determine the final range operation output.
• Imm8[3:2] specifies the sign of the range operation output to be one of the following: from the first input
value, from the comparison result, set or clear.
The encodings of imm8[1:0] and imm8[3:2] are shown in Figure 5-27.
When one or more of the input value is a NAN, the comparison operation may signal invalid exception (IE). Details
with one of more input value is NAN is listed in Table 5-21. If the comparison raises an IE, the sign select control
(imm8[3:2]) has no effect to the range operation output; this is indicated also in Table 5-21.
When both input values are zeros of opposite signs, the comparison operation of MIN/MAX in the range compare
operation is slightly different from the conceptually similar floating-point MIN/MAX operation that are found in the
instructions VMAXPD/VMINPD. The details of MIN/MAX/MIN_ABS/MAX_ABS operation for VRANGEPD/PS/SD/SS
for magnitude-0, opposite-signed input cases are listed in Table 5-22.
Additionally, non-zero, equal-magnitude with opposite-sign input values perform MIN_ABS or MAX_ABS compar-
ison operation with result listed in Table 5-23.
VRANGEPS—Range Restriction Calculation for Packed Pairs of Float32 Values Vol. 2C 5-641
Operation
RangeSP(SRC1[31:0], SRC2[31:0], CmpOpCtl[1:0], SignSelCtl[1:0])
{
// Check if SNAN and report IE, see also Table 5-21
IF (SRC1=SNAN) THEN RETURN (QNAN(SRC1), set IE);
IF (SRC2=SNAN) THEN RETURN (QNAN(SRC2), set IE);
Src1.exp := SRC1[30:23];
Src1.fraction := SRC1[22:0];
IF ((Src1.exp = 0 ) and (Src1.fraction != 0 )) THEN// Src1 is a denormal number
IF DAZ THEN Src1.fraction := 0;
ELSE IF (SRC2 <> QNAN) Set DE; FI;
FI;
Src2.exp := SRC2[30:23];
Src2.fraction := SRC2[22:0];
IF ((Src2.exp = 0 ) and (Src2.fraction != 0 )) THEN// Src2 is a denormal number
IF DAZ THEN Src2.fraction := 0;
ELSE IF (SRC1 <> QNAN) Set DE; FI;
FI;
CmpOpCtl[1:0]= imm8[1:0];
SignSelCtl[1:0]=imm8[3:2];
VRANGEPS
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := RangeSP (SRC1[i+31:i], SRC2[31:0], CmpOpCtl[1:0], SignSelCtl[1:0]);
ELSE DEST[i+31:i] := RangeSP (SRC1[i+31:i], SRC2[i+31:i], CmpOpCtl[1:0], SignSelCtl[1:0]);
FI;
VRANGEPS—Range Restriction Calculation for Packed Pairs of Float32 Values Vol. 2C 5-642
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] = 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
The following example describes a common usage of this instruction for checking that the input operand is
bounded between ±150.
Where:
zmm_dst is the destination operand.
zmm_src is the input operand to compare against ±150.
zmm_150 is the reference operand, contains the value of 150.
IMM=02(imm8[1:0]=’10) selects the Min Absolute value operation with selection of src1.sign.
In case |zmm_src| < 150, then its value will be written into zmm_dst. Otherwise, the value stored in zmm_dst
will get the value of 150 (received on zmm_150).
However, the sign control (imm8[3:2]=’00) instructs to select the sign of SRC1 received from zmm_src. So, even
in the case of |zmm_src| ≥ 150, the selected sign of SRC1 is kept.
Thus, if zmm_src < -150, the result of VRANGEPS will be the minimal value of -150 while if zmm_src > +150,
the result of VRANGE will be the maximal value of +150.
Invalid, Denormal.
Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
VRANGEPS—Range Restriction Calculation for Packed Pairs of Float32 Values Vol. 2C 5-643
VRANGESD—Range Restriction Calculation From a Pair of Scalar Float64 Values
Opcode/ Op / 64/32 CPUID Description
Instruction En bit Mode Feature Flag
Support
EVEX.LLIG.66.0F3A.W1 51 /r A V/V AVX512DQ Calculate a RANGE operation output value from 2 double
VRANGESD xmm1 {k1}{z}, OR AVX10.11 precision floating-point values in xmm2 and xmm3/m64,
xmm2, xmm3/m64{sae}, imm8 store the output to xmm1 under writemask. Imm8
specifies the comparison and sign of the range operation.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction calculates a range operation output from two input double precision floating-point values in the low
qword element of the first source operand (the second operand) and second source operand (the third operand).
The range output is written to the low qword element of the destination operand (the first operand) under the
writemask k1.
Bits7:4 of imm8 byte must be zero. The range operation output is performed in two parts, each configured by a
two-bit control field within imm8[3:0]:
• Imm8[1:0] specifies the initial comparison operation to be one of max, min, max absolute value or min
absolute value of the input value pair. Each comparison of two input values produces an intermediate result
that combines with the sign selection control (imm8[3:2]) to determine the final range operation output.
• Imm8[3:2] specifies the sign of the range operation output to be one of the following: from the first input
value, from the comparison result, set or clear.
The encodings of imm8[1:0] and imm8[3:2] are shown in Figure 5-27.
Bits 128:63 of the destination operand are copied from the respective element of the first source operand.
When one or more of the input value is a NAN, the comparison operation may signal invalid exception (IE). Details
with one of more input value is NAN is listed in Table 5-21. If the comparison raises an IE, the sign select control
(imm8[3:2]) has no effect to the range operation output; this is indicated also in Table 5-21.
When both input values are zeros of opposite signs, the comparison operation of MIN/MAX in the range compare
operation is slightly different from the conceptually similar floating-point MIN/MAX operation that are found in the
instructions VMAXPD/VMINPD. The details of MIN/MAX/MIN_ABS/MAX_ABS operation for VRANGEPD/PS/SD/SS
for magnitude-0, opposite-signed input cases are listed in Table 5-22.
Additionally, non-zero, equal-magnitude with opposite-sign input values perform MIN_ABS or MAX_ABS compar-
ison operation with result listed in Table 5-23.
VRANGESD—Range Restriction Calculation From a Pair of Scalar Float64 Values Vol. 2C 5-644
Operation
RangeDP(SRC1[63:0], SRC2[63:0], CmpOpCtl[1:0], SignSelCtl[1:0])
{
// Check if SNAN and report IE, see also Table 5-21
IF (SRC1 = SNAN) THEN RETURN (QNAN(SRC1), set IE);
IF (SRC2 = SNAN) THEN RETURN (QNAN(SRC2), set IE);
Src1.exp := SRC1[62:52];
Src1.fraction := SRC1[51:0];
IF ((Src1.exp = 0 ) and (Src1.fraction != 0)) THEN// Src1 is a denormal number
IF DAZ THEN Src1.fraction := 0;
ELSE IF (SRC2 <> QNAN) Set DE; FI;
FI;
Src2.exp := SRC2[62:52];
Src2.fraction := SRC2[51:0];
IF ((Src2.exp = 0) and (Src2.fraction !=0 )) THEN// Src2 is a denormal number
IF DAZ THEN Src2.fraction := 0;
ELSE IF (SRC1 <> QNAN) Set DE; FI;
FI;
Case(SignSelCtl[1:0])
00: dest := (SRC1[63] << 63) OR (TMP[62:0]);// Preserve Src1 sign bit
01: dest := TMP[63:0];// Preserve sign of compare result
10: dest := (0 << 63) OR (TMP[62:0]);// Zero out sign bit
11: dest := (1 << 63) OR (TMP[62:0]);// Set the sign bit
ESAC;
RETURN dest[63:0];
}
CmpOpCtl[1:0]= imm8[1:0];
SignSelCtl[1:0]=imm8[3:2];
VRANGESD—Range Restriction Calculation From a Pair of Scalar Float64 Values Vol. 2C 5-645
VRANGESD
IF k1[0] OR *no writemask*
THEN DEST[63:0] := RangeDP (SRC1[63:0], SRC2[63:0], CmpOpCtl[1:0], SignSelCtl[1:0]);
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
DEST[63:0] = 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0
The following example describes a common usage of this instruction for checking that the input operand is
bounded between ±1023.
Where:
xmm_dst is the destination operand.
xmm_src is the input operand to compare against ±1023.
xmm_1023 is the reference operand, contains the value of 1023.
IMM=02(imm8[1:0]=’10) selects the Min Absolute value operation with selection of src1.sign.
In case |xmm_src| < 1023, then its value will be written into xmm_dst. Otherwise, the value stored in xmm_dst
will get the value of 1023 (received on xmm_1023).
However, the sign control (imm8[3:2]=’00) instructs to select the sign of SRC1 received from xmm_src. So, even
in the case of |xmm_src| ≥ 1023, the selected sign of SRC1 is kept.
Thus, if xmm_src < -1023, the result of VRANGEPD will be the minimal value of -1023while if xmm_src > +1023,
the result of VRANGE will be the maximal value of +1023.
Invalid, Denormal.
Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”
VRANGESD—Range Restriction Calculation From a Pair of Scalar Float64 Values Vol. 2C 5-646
VRANGESS—Range Restriction Calculation From a Pair of Scalar Float32 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.LLIG.66.0F3A.W0 51 /r A V/V AVX512DQ Calculate a RANGE operation output value from 2
VRANGESS xmm1 {k1}{z}, OR AVX10.11 single-precision floating-point values in xmm2 and
xmm2, xmm3/m32{sae}, imm8 xmm3/m32, store the output to xmm1 under
writemask. Imm8 specifies the comparison and sign of
the range operation.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction calculates a range operation output from two input single precision floating-point values in the low
dword element of the first source operand (the second operand) and second source operand (the third operand).
The range output is written to the low dword element of the destination operand (the first operand) under the
writemask k1.
Bits7:4 of imm8 byte must be zero. The range operation output is performed in two parts, each configured by a
two-bit control field within imm8[3:0]:
• Imm8[1:0] specifies the initial comparison operation to be one of max, min, max absolute value or min
absolute value of the input value pair. Each comparison of two input values produces an intermediate result
that combines with the sign selection control (imm8[3:2]) to determine the final range operation output.
• Imm8[3:2] specifies the sign of the range operation output to be one of the following: from the first input
value, from the comparison result, set or clear.
The encodings of imm8[1:0] and imm8[3:2] are shown in Figure 5-27.
Bits 128:31 of the destination operand are copied from the respective elements of the first source operand.
When one or more of the input value is a NAN, the comparison operation may signal invalid exception (IE). Details
with one of more input value is NAN is listed in Table 5-21. If the comparison raises an IE, the sign select control
(imm8[3:2]) has no effect to the range operation output; this is indicated also in Table 5-21.
When both input values are zeros of opposite signs, the comparison operation of MIN/MAX in the range compare
operation is slightly different from the conceptually similar floating-point MIN/MAX operation that are found in the
instructions VMAXPD/VMINPD. The details of MIN/MAX/MIN_ABS/MAX_ABS operation for VRANGEPD/PS/SD/SS
for magnitude-0, opposite-signed input cases are listed in Table 5-22.
Additionally, non-zero, equal-magnitude with opposite-sign input values perform MIN_ABS or MAX_ABS compar-
ison operation with result listed in Table 5-23.
VRANGESS—Range Restriction Calculation From a Pair of Scalar Float32 Values Vol. 2C 5-647
Operation
RangeSP(SRC1[31:0], SRC2[31:0], CmpOpCtl[1:0], SignSelCtl[1:0])
{
// Check if SNAN and report IE, see also Table 5-21
IF (SRC1=SNAN) THEN RETURN (QNAN(SRC1), set IE);
IF (SRC2=SNAN) THEN RETURN (QNAN(SRC2), set IE);
Src1.exp := SRC1[30:23];
Src1.fraction := SRC1[22:0];
IF ((Src1.exp = 0 ) and (Src1.fraction != 0 )) THEN// Src1 is a denormal number
IF DAZ THEN Src1.fraction := 0;
ELSE IF (SRC2 <> QNAN) Set DE; FI;
FI;
Src2.exp := SRC2[30:23];
Src2.fraction := SRC2[22:0];
IF ((Src2.exp = 0 ) and (Src2.fraction != 0 )) THEN// Src2 is a denormal number
IF DAZ THEN Src2.fraction := 0;
ELSE IF (SRC1 <> QNAN) Set DE; FI;
FI;
CmpOpCtl[1:0]= imm8[1:0];
SignSelCtl[1:0]=imm8[3:2];
VRANGESS—Range Restriction Calculation From a Pair of Scalar Float32 Values Vol. 2C 5-648
VRANGESS
IF k1[0] OR *no writemask*
THEN DEST[31:0] := RangeSP (SRC1[31:0], SRC2[31:0], CmpOpCtl[1:0], SignSelCtl[1:0]);
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
DEST[31:0] = 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0
The following example describes a common usage of this instruction for checking that the input operand is
bounded between ±150.
Where:
xmm_dst is the destination operand.
xmm_src is the input operand to compare against ±150.
xmm_150 is the reference operand, contains the value of 150.
IMM=02(imm8[1:0]=’10) selects the Min Absolute value operation with selection of src1.sign.
In case |xmm_src| < 150, then its value will be written into zmm_dst. Otherwise, the value stored in xmm_dst
will get the value of 150 (received on zmm_150).
However, the sign control (imm8[3:2]=’00) instructs to select the sign of SRC1 received from xmm_src. So, even
in the case of |xmm_src| ≥ 150, the selected sign of SRC1 is kept.
Thus, if xmm_src < -150, the result of VRANGESS will be the minimal value of -150 while if xmm_src > +150,
the result of VRANGE will be the maximal value of +150.
Invalid, Denormal.
Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”
VRANGESS—Range Restriction Calculation From a Pair of Scalar Float32 Values Vol. 2C 5-649
VRCP14PD—Compute Approximate Reciprocals of Packed Float64 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 4C /r A V/V (AVX512VL AND Computes the approximate reciprocals of the packed
VRCP14PD xmm1 {k1}{z}, AVX512F) OR double precision floating-point values in
xmm2/m128/m64bcst AVX10.11 xmm2/m128/m64bcst and stores the results in xmm1.
Under writemask.
EVEX.256.66.0F38.W1 4C /r A V/V (AVX512VL AND Computes the approximate reciprocals of the packed
VRCP14PD ymm1 {k1}{z}, AVX512F) OR double precision floating-point values in
ymm2/m256/m64bcst AVX10.11 ymm2/m256/m64bcst and stores the results in ymm1.
Under writemask.
EVEX.512.66.0F38.W1 4C /r A V/V AVX512F Computes the approximate reciprocals of the packed
VRCP14PD zmm1 {k1}{z}, OR AVX10.11 double precision floating-point values in
zmm2/m512/m64bcst zmm2/m512/m64bcst and stores the results in zmm1.
Under writemask.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a SIMD computation of the approximate reciprocals of eight/four/two packed double
precision floating-point values in the source operand (the second operand) and stores the packed double precision
floating-point results in the destination operand. The maximum relative error for this approximation is less than 2-
14.
The source operand can be a ZMM register, a 512-bit memory location, or a 512-bit vector broadcasted from a 64-
bit memory location. The destination operand is a ZMM register conditionally updated according to the writemask.
The VRCP14PD instruction is not affected by the rounding control bits in the MXCSR register. When a source value
is a 0.0, an ∞ with the sign of the source value is returned. A denormal source value will be treated as zero only in
case of DAZ bit set in MXCSR. Otherwise it is treated correctly (i.e., not as a 0.0). Underflow results are flushed to
zero only in case of FTZ bit set in MXCSR. Otherwise it will be treated correctly (i.e., correct underflow result is
written) with the sign of the operand. When a source value is a SNaN or QNaN, the SNaN is converted to a QNaN
or the source QNaN is returned.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported.
Operation
VRCP14PD ((EVEX encoded versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN DEST[i+63:i] := APPROXIMATE(1.0/SRC[63:0]);
ELSE DEST[i+63:i] := APPROXIMATE(1.0/SRC[i+63:i]);
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
None.
Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a SIMD computation of the approximate reciprocals of the packed single precision
floating-point values in the source operand (the second operand) and stores the packed single precision floating-
point results in the destination operand (the first operand). The maximum relative error for this approximation is
less than 2-14.
The source operand can be a ZMM register, a 512-bit memory location or a 512-bit vector broadcasted from a 32-
bit memory location. The destination operand is a ZMM register conditionally updated according to the writemask.
The VRCP14PS instruction is not affected by the rounding control bits in the MXCSR register. When a source value
is a 0.0, an ∞ with the sign of the source value is returned. A denormal source value will be treated as zero only in
case of DAZ bit set in MXCSR. Otherwise it is treated correctly (i.e., not as a 0.0). Underflow results are flushed to
zero only in case of FTZ bit set in MXCSR. Otherwise it will be treated correctly (i.e., correct underflow result is
written) with the sign of the operand. When a source value is a SNaN or QNaN, the SNaN is converted to a QNaN
or the source QNaN is returned.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported.
Operation
VRCP14PS (EVEX encoded versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN DEST[i+31:i] := APPROXIMATE(1.0/SRC[31:0]);
ELSE DEST[i+31:i] := APPROXIMATE(1.0/SRC[i+31:i]);
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
None.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a SIMD computation of the approximate reciprocal of the low double precision floating-
point value in the second source operand (the third operand) stores the result in the low quadword element of the
destination operand (the first operand) according to the writemask k1. Bits (127:64) of the XMM register destina-
tion are copied from corresponding bits in the first source operand (the second operand). The maximum relative
error for this approximation is less than 2-14. The source operand can be an XMM register or a 64-bit memory loca-
tion. The destination operand is an XMM register.
The VRCP14SD instruction is not affected by the rounding control bits in the MXCSR register. When a source value
is a 0.0, an ∞ with the sign of the source value is returned. A denormal source value will be treated as zero only in
case of DAZ bit set in MXCSR. Otherwise it is treated correctly (i.e., not as a 0.0). Underflow results are flushed to
zero only in case of FTZ bit set in MXCSR. Otherwise it will be treated correctly (i.e., correct underflow result is
written) with the sign of the operand. When a source value is a SNaN or QNaN, the SNaN is converted to a QNaN
or the source QNaN is returned. See Table 5-24 for special-case input values.
MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported.
A numerically exact implementation of VRCP14xx can be found at:
https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-
vrsqrt14-vrcp28-vrsqrt28-vexp2.
Operation
VRCP14SD (EVEX version)
IF k1[0] OR *no writemask*
THEN DEST[63:0] := APPROXIMATE(1.0/SRC2[63:0]);
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
DEST[63:0] := 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0
None.
Other Exceptions
See Table 2-53, “Type E5 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a SIMD computation of the approximate reciprocal of the low single precision floating-
point value in the second source operand (the third operand) and stores the result in the low quadword element of
the destination operand (the first operand) according to the writemask k1. Bits (127:32) of the XMM register desti-
nation are copied from corresponding bits in the first source operand (the second operand). The maximum relative
error for this approximation is less than 2-14. The source operand can be an XMM register or a 32-bit memory loca-
tion. The destination operand is an XMM register.
The VRCP14SS instruction is not affected by the rounding control bits in the MXCSR register. When a source value
is a 0.0, an ∞ with the sign of the source value is returned. A denormal source value will be treated as zero only in
case of DAZ bit set in MXCSR. Otherwise it is treated correctly (i.e., not as a 0.0). Underflow results are flushed to
zero only in case of FTZ bit set in MXCSR. Otherwise it will be treated correctly (i.e., correct underflow result is
written) with the sign of the operand. When a source value is a SNaN or QNaN, the SNaN is converted to a QNaN
or the source QNaN is returned. See Table 5-25 for special-case input values.
MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported.
A numerically exact implementation of VRCP14xx can be found at https://software.intel.com/en-us/articles/refer-
ence-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2.
Operation
VRCP14SS (EVEX version)
IF k1[0] OR *no writemask*
THEN DEST[31:0] := APPROXIMATE(1.0/SRC2[31:0]);
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
DEST[31:0] := 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0
None.
Other Exceptions
See Table 2-53, “Type E5 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a SIMD computation of the approximate reciprocals of 8/16/32 packed FP16 values in the
source operand (the second operand) and stores the packed FP16 results in the destination operand. The maximum
relative error for this approximation is less than 2−11 + 2−14.
For special cases, see Table 5-26.
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.fp16[0]
ELSE:
tsrc := src.fp16[i]
DEST.fp16[i] := APPROXIMATE(1.0 / tsrc)
ELSE IF *zeroing*:
DEST.fp16[i] := 0
//else DEST.fp16[i] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a SIMD computation of the approximate reciprocal of the low FP16 value in the second
source operand (the third operand) and stores the result in the low word element of the destination operand (the
first operand) according to the writemask k1. Bits 127:16 of the XMM register destination are copied from corre-
sponding bits in the first source operand (the second operand). The maximum relative error for this approximation
is less than 2−11 + 2−14.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
For special cases, see Table 5-26.
Operation
VRCPSH dest{k1}, src1, src2
IF k1[0] or *no writemask*:
DEST.fp16[0] := APPROXIMATE(1.0 / src2.fp16[0])
ELSE IF *zeroing*:
DEST.fp16[0] := 0
//else DEST.fp16[0] remains unchanged
DEST[127:16] := src1[127:16]
DEST[MAXVL-1:128] := 0
Other Exceptions
EVEX-encoded instruction, see Table 2-60, “Type E10 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Perform reduction transformation of the packed binary encoded double precision floating-point values in the source
operand (the second operand) and store the reduced results in binary floating-point format to the destination
operand (the first operand) under the writemask k1.
The reduction transformation subtracts the integer part and the leading M fractional bits from the binary floating-
point source value, where M is a unsigned integer specified by imm8[7:4], see Figure 5-28. Specifically, the reduc-
tion transformation can be expressed as:
dest = src – (ROUND(2M*src))*2-M;
where “Round()” treats “src”, “2M”, and their product as binary floating-point numbers with normalized signifi-
cand and biased exponents.
The magnitude of the reduced result can be expressed by considering src= 2p*man2,
where ‘man2’ is the normalized significand and ‘p’ is the unbiased exponent
Then if RC = RNE: 0<=|Reduced Result|<=2p-M-1
Then if RC ≠ RNE: 0<=|Reduced Result|<2p-M
This instruction might end up with a precision exception set. However, in case of SPE set (i.e., Suppress Precision
Exception, which is imm8[3]=1), no precision exception is reported.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
ReduceArgumentDP(SRC[63:0], imm8[7:0])
{
// Check for NaN
IF (SRC [63:0] = NAN) THEN
RETURN (Convert SRC[63:0] to QNaN); FI;
M := imm8[7:4]; // Number of fraction bits of the normalized significand to be subtracted
RC := imm8[1:0];// Round Control for ROUND() operation
RC source := imm[2];
SPE := imm[3];// Suppress Precision Exception
TMP[63:0] := 2-M *{ROUND(2M*SRC[63:0], SPE, RC_source, RC)}; // ROUND() treats SRC and 2M as standard binary FP values
TMP[63:0] := SRC[63:0] – TMP[63:0]; // subtraction under the same RC,SPE controls
RETURN TMP[63:0]; // binary encoded FP with biased exponent and normalized significand
}
Invalid, Precision.
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).
Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
Additionally:
#UD If EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a reduction transformation of the packed binary encoded FP16 values in the source
operand (the second operand) and store the reduced results in binary FP format to the destination operand (the
first operand) under the writemask k1.
The reduction transformation subtracts the integer part and the leading M fractional bits from the binary FP source
value, where M is a unsigned integer specified by imm8[7:4]. Specifically, the reduction transformation can be
expressed as:
dest = src − (ROUND(2M * src)) * 2−M
where ROUND() treats src, 2M, and their product as binary FP numbers with normalized significand and biased
exponents.
The magnitude of the reduced result can be expressed by considering src = 2p * man2, where ‘man2’ is the normal-
ized significand and ‘p’ is the unbiased exponent.
Then if RC=RNE: 0 ≤ |ReducedResult| ≤ 2−M−1.
Then if RC ≠ RNE: 0 ≤ |ReducedResult| < 2−M.
This instruction might end up with a precision exception set. However, in case of SPE set (i.e., Suppress Precision
Exception, which is imm8[3]=1), no precision exception is reported.
This instruction may generate tiny non-zero result. If it does so, it does not report underflow exception, even if
underflow exceptions are unmasked (UM flag in MXCSR register is 0).
For special cases, see Table 5-28.
Operation
def reduce_fp16(src, imm8):
nan := (src.exp = 0x1F) and (src.fraction != 0)
if nan:
return QNAN(src)
m := imm8[7:4]
rc := imm8[1:0]
rc_source := imm8[2]
spe := imm[3] // suppress precision exception
tmp := 2^(-m) * ROUND(2^m * src, spe, rc_source, rc)
tmp := src - tmp // using same RC, SPE controls
return tmp
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.fp16[0]
ELSE:
tsrc := src.fp16[i]
DEST.fp16[i] := reduce_fp16(tsrc, imm8)
ELSE IF *zeroing*:
DEST.fp16[i] := 0
//else DEST.fp16[i] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Perform reduction transformation of the packed binary encoded single precision floating-point values in the source
operand (the second operand) and store the reduced results in binary floating-point format to the destination
operand (the first operand) under the writemask k1.
The reduction transformation subtracts the integer part and the leading M fractional bits from the binary floating-
point source value, where M is a unsigned integer specified by imm8[7:4], see Figure 5-28. Specifically, the reduc-
tion transformation can be expressed as:
dest = src – (ROUND(2M*src))*2-M;
where “Round()” treats “src”, “2M”, and their product as binary floating-point numbers with normalized signifi-
cand and biased exponents.
The magnitude of the reduced result can be expressed by considering src= 2p*man2,
where ‘man2’ is the normalized significand and ‘p’ is the unbiased exponent
Then if RC = RNE: 0<=|Reduced Result|<=2p-M-1
Then if RC ≠ RNE: 0<=|Reduced Result|<2p-M
This instruction might end up with a precision exception set. However, in case of SPE set (i.e., Suppress Precision
Exception, which is imm8[3]=1), no precision exception is reported.
EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Handling of special case of input values are listed in Table 5-27.
VREDUCEPS
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b == 1) AND (SRC *is memory*)
THEN DEST[i+31:i] := ReduceArgumentSP(SRC[31:0], imm8[7:0]);
ELSE DEST[i+31:i] := ReduceArgumentSP(SRC[i+31:i], imm8[7:0]);
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] = 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
Invalid, Precision.
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).
Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions”; additionally:
#UD If EVEX.vvvv != 1111B.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Perform a reduction transformation of the binary encoded double precision floating-point value in the low qword
element of the second source operand (the third operand) and store the reduced result in binary floating-point
format to the low qword element of the destination operand (the first operand) under the writemask k1. Bits
127:64 of the destination operand are copied from respective qword elements of the first source operand (the
second operand).
The reduction transformation subtracts the integer part and the leading M fractional bits from the binary floating-
point source value, where M is a unsigned integer specified by imm8[7:4], see Figure 5-28. Specifically, the reduc-
tion transformation can be expressed as:
dest = src – (ROUND(2M*src))*2-M;
where “Round()” treats “src”, “2M”, and their product as binary floating-point numbers with normalized signifi-
cand and biased exponents.
The magnitude of the reduced result can be expressed by considering src= 2p*man2,
where ‘man2’ is the normalized significand and ‘p’ is the unbiased exponent
Then if RC = RNE: 0<=|Reduced Result|<=2p-M-1
Then if RC ≠ RNE: 0<=|Reduced Result|<2p-M
This instruction might end up with a precision exception set. However, in case of SPE set (i.e., Suppress Precision
Exception, which is imm8[3]=1), no precision exception is reported.
The operation is write masked.
Handling of special case of input values are listed in Table 5-27.
VREDUCESD
IF k1[0] or *no writemask*
THEN DEST[63:0] := ReduceArgumentDP(SRC2[63:0], imm8[7:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] = 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0
Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a reduction transformation of the low binary encoded FP16 value in the source operand
(the second operand) and store the reduced result in binary FP format to the low element of the destination
operand (the first operand) under the writemask k1. For further details see the description of VREDUCEPH.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
This instruction might end up with a precision exception set. However, in case of SPE set (i.e., Suppress Precision
Exception, which is imm8[3]=1), no precision exception is reported.
This instruction may generate tiny non-zero result. If it does so, it does not report underflow exception, even if
underflow exceptions are unmasked (UM flag in MXCSR register is 0).
For special cases, see Table 5-28.
Operation
VREDUCESH dest{k1}, src, imm8
IF k1[0] or *no writemask*:
dest.fp16[0] := reduce_fp16(src2.fp16[0], imm8) // see VREDUCEPH
ELSE IF *zeroing*:
dest.fp16[0] := 0
//else dest.fp16[0] remains unchanged
DEST[127:16] := src1[127:16]
DEST[MAXVL-1:128] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Perform a reduction transformation of the binary encoded single precision floating-point value in the low dword
element of the second source operand (the third operand) and store the reduced result in binary floating-point
format to the low dword element of the destination operand (the first operand) under the writemask k1. Bits
127:32 of the destination operand are copied from respective dword elements of the first source operand (the
second operand).
The reduction transformation subtracts the integer part and the leading M fractional bits from the binary floating-
point source value, where M is a unsigned integer specified by imm8[7:4], see Figure 5-28. Specifically, the reduc-
tion transformation can be expressed as:
dest = src – (ROUND(2M*src))*2-M;
where “Round()” treats “src”, “2M”, and their product as binary floating-point numbers with normalized signifi-
cand and biased exponents.
The magnitude of the reduced result can be expressed by considering src= 2p*man2,
where ‘man2’ is the normalized significand and ‘p’ is the unbiased exponent
Then if RC = RNE: 0<=|Reduced Result|<=2p-M-1
Then if RC ≠ RNE: 0<=|Reduced Result|<2p-M
This instruction might end up with a precision exception set. However, in case of SPE set (i.e., Suppress Precision
Exception, which is imm8[3]=1), no precision exception is reported.
Handling of special case of input values are listed in Table 5-27.
VREDUCESS
IF k1[0] or *no writemask*
THEN DEST[31:0] := ReduceArgumentSP(SRC2[31:0], imm8[7:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] = 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0
Invalid, Precision.
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).
Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Round the double precision floating-point values in the source operand by the rounding mode specified in the
immediate operand (see Figure 5-29) and places the result in the destination operand.
The destination operand (the first operand) is a ZMM/YMM/XMM register conditionally updated according to the
writemask. The source operand (the second operand) can be a ZMM/YMM/XMM register, a 512/256/128-bit
memory location, or a 512/256/128-bit vector broadcasted from a 64-bit memory location.
The rounding process rounds the input to an integral value, plus number bits of fraction that are specified by
imm8[7:4] (to be included in the result) and returns the result as a double precision floating-point value.
It should be noticed that no overflow is induced while executing this instruction (although the source is scaled by
the imm8[7:4] value).
The immediate operand also specifies control fields for the rounding operation, three bit fields are defined and
shown in the “Immediate Control Description” figure below. Bit 3 of the immediate byte controls the processor
behavior for a precision exception, bit 2 selects the source of rounding mode control. Bits 1:0 specify a non-sticky
rounding-mode value (immediate control table below lists the encoded values for rounding-mode field).
The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an
SNaN then it will be converted to a QNaN. If DAZ is set to ‘1 then denormals will be converted to zero before
rounding.
The sign of the result of this instruction is preserved, including the sign of zero.
The formula of the operation on each data element for VRNDSCALEPD is
ROUND(x) = 2-M*Round_to_INT(x*2M, round_ctrl),
round_ctrl = imm[3:0];
M=imm[7:4];
The operation of x*2M is computed as if the exponent range is unlimited (i.e., no overflow ever occurs).
VRNDSCALEPD—Round Packed Float64 Values to Include a Given Number of Fraction Bits Vol. 2C 5-675
VRNDSCALEPD is a more general form of the VEX-encoded VROUNDPD instruction. In VROUNDPD, the formula of
the operation on each element is
ROUND(x) = Round_to_INT(x, round_ctrl),
round_ctrl = imm[3:0];
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
7 6 5 4 3 2 1 0
Operation
RoundToIntegerDP(SRC[63:0], imm8[7:0]) {
if (imm8[2] = 1)
rounding_direction := MXCSR:RC ; get round control from MXCSR
else
rounding_direction := imm8[1:0] ; get round control from imm8[1:0]
FI
M := imm8[7:4] ; get the scaling factor
case (rounding_direction)
00: TMP[63:0] := round_to_nearest_even_integer(2M*SRC[63:0])
01: TMP[63:0] := round_to_equal_or_smaller_integer(2M*SRC[63:0])
10: TMP[63:0] := round_to_equal_or_larger_integer(2M*SRC[63:0])
11: TMP[63:0] := round_to_nearest_smallest_magnitude_integer(2M*SRC[63:0])
ESAC
VRNDSCALEPD—Round Packed Float64 Values to Include a Given Number of Fraction Bits Vol. 2C 5-676
return(Dest[63:0])
}
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask*
THEN DEST[i+63:i] := RoundToIntegerDP((TMP_SRC[i+63:i], imm8[7:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
Invalid, Precision.
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).
Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
VRNDSCALEPD—Round Packed Float64 Values to Include a Given Number of Fraction Bits Vol. 2C 5-677
VRNDSCALEPH—Round Packed FP16 Values to Include a Given Number of Fraction Bits
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.NP.0F3A.W0 08 /r /ib A V/V (AVX512-FP16 Round packed FP16 values in
VRNDSCALEPH xmm1{k1}{z}, AND AVX512VL) xmm2/m128/m16bcst to a number of fraction
xmm2/m128/m16bcst, imm8 OR AVX10.11 bits specified by the imm8 field. Store the result
in xmm1 subject to writemask k1.
EVEX.256.NP.0F3A.W0 08 /r /ib A V/V (AVX512-FP16 Round packed FP16 values in
VRNDSCALEPH ymm1{k1}{z}, AND AVX512VL) ymm2/m256/m16bcst to a number of fraction
ymm2/m256/m16bcst, imm8 OR AVX10.11 bits specified by the imm8 field. Store the result
in ymm1 subject to writemask k1.
EVEX.512.NP.0F3A.W0 08 /r /ib A V/V AVX512-FP16 Round packed FP16 values in
VRNDSCALEPH zmm1{k1}{z}, OR AVX10.11 zmm2/m512/m16bcst to a number of fraction
zmm2/m512/m16bcst {sae}, imm8 bits specified by the imm8 field. Store the result
in zmm1 subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction rounds the FP16 values in the source operand by the rounding mode specified in the immediate
operand (see Table 5-30) and places the result in the destination operand. The destination operand is conditionally
updated according to the writemask.
The rounding process rounds the input to an integral value, plus number bits of fraction that are specified by
imm8[7:4] (to be included in the result), and returns the result as an FP16 value.
Note that no overflow is induced while executing this instruction (although the source is scaled by the imm8[7:4]
value).
The immediate operand also specifies control fields for the rounding operation. Three bit fields are defined and
shown in Table 5-30, “Imm8 Controls for VRNDSCALEPH/VRNDSCALESH.” Bit 3 of the immediate byte controls the
processor behavior for a precision exception, bit 2 selects the source of rounding mode control, and bits 1:0 specify
a non-sticky rounding-mode value.
The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an
SNaN then it will be converted to a QNaN.
The sign of the result of this instruction is preserved, including the sign of zero. Special cases are described in Table
5-31.
The formula of the operation on each data element for VRNDSCALEPH is
ROUND(x) = 2−M *Round_to_INT(x * 2M, round_ctrl),
round_ctrl = imm[3:0];
M=imm[7:4];
The operation of x * 2M is computed as if the exponent range is unlimited (i.e., no overflow ever occurs).
If this instruction encoding’s SPE bit (bit 3) in the immediate operand is 1, VRNDSCALEPH can set MXCSR.UE
without MXCSR.PE.
EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
VRNDSCALEPH—Round Packed FP16 Values to Include a Given Number of Fraction Bits Vol. 2C 5-678
Table 5-30. Imm8 Controls for VRNDSCALEPH/VRNDSCALESH
Imm8 Bits Description
imm8[7:4] Number of fixed points to preserve.
imm8[3] Suppress Precision Exception (SPE)
0b00: Implies use of MXCSR exception mask.
0b01: Implies suppress.
imm8[2] Round Select (RS)
0b00: Implies use of imm8[1:0].
0b01: Implies use of MXCSR.
imm8[1:0] Round Control Override:
0b00: Round nearest even.
0b01: Round down.
0b10: Round up.
0b11: Truncate.
Operation
def round_fp16_to_integer(src, imm8):
if imm8[2] = 1:
rounding_direction := MXCSR.RC
else:
rounding_direction := imm8[1:0]
m := imm8[7:4] // scaling factor
if rounding_direction = 0b00:
tmp := round_to_nearest_even_integer(trc1)
else if rounding_direction = 0b01:
tmp := round_to_equal_or_smaller_integer(trc1)
else if rounding_direction = 0b10:
tmp := round_to_equal_or_larger_integer(trc1)
else if rounding_direction = 0b11:
tmp := round_to_smallest_magnitude_integer(trc1)
VRNDSCALEPH—Round Packed FP16 Values to Include a Given Number of Fraction Bits Vol. 2C 5-679
VRNDSCALEPH dest{k1}, src, imm8
VL = 128, 256 or 512
KL := VL/16
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.fp16[0]
ELSE:
tsrc := src.fp16[i]
DEST.fp16[i] := round_fp16_to_integer(tsrc, imm8)
ELSE IF *zeroing*:
DEST.fp16[i] := 0
//else DEST.fp16[i] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”
VRNDSCALEPH—Round Packed FP16 Values to Include a Given Number of Fraction Bits Vol. 2C 5-680
VRNDSCALEPS—Round Packed Float32 Values to Include a Given Number of Fraction Bits
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F3A.W0 08 /r ib A V/V (AVX512VL Rounds packed single-precision floating-point values
VRNDSCALEPS xmm1 {k1}{z}, AND AVX512F) in xmm2/m128/m32bcst to a number of fraction bits
xmm2/m128/m32bcst, imm8 OR AVX10.11 specified by the imm8 field. Stores the result in xmm1
register. Under writemask.
EVEX.256.66.0F3A.W0 08 /r ib A V/V (AVX512VL Rounds packed single-precision floating-point values
VRNDSCALEPS ymm1 {k1}{z}, AND AVX512F) in ymm2/m256/m32bcst to a number of fraction bits
ymm2/m256/m32bcst, imm8 OR AVX10.11 specified by the imm8 field. Stores the result in ymm1
register. Under writemask.
EVEX.512.66.0F3A.W0 08 /r ib A V/V AVX512F Rounds packed single-precision floating-point values
VRNDSCALEPS zmm1 {k1}{z}, OR AVX10.11 in zmm2/m512/m32bcst to a number of fraction bits
zmm2/m512/m32bcst{sae}, imm8 specified by the imm8 field. Stores the result in zmm1
register using writemask.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Round the single precision floating-point values in the source operand by the rounding mode specified in the imme-
diate operand (see Figure 5-29) and places the result in the destination operand.
The destination operand (the first operand) is a ZMM register conditionally updated according to the writemask.
The source operand (the second operand) can be a ZMM register, a 512-bit memory location, or a 512-bit vector
broadcasted from a 32-bit memory location.
The rounding process rounds the input to an integral value, plus number bits of fraction that are specified by
imm8[7:4] (to be included in the result) and returns the result as a single precision floating-point value.
It should be noticed that no overflow is induced while executing this instruction (although the source is scaled by
the imm8[7:4] value).
The immediate operand also specifies control fields for the rounding operation, three bit fields are defined and
shown in the “Immediate Control Description” figure below. Bit 3 of the immediate byte controls the processor
behavior for a precision exception, bit 2 selects the source of rounding mode control. Bits 1:0 specify a non-sticky
rounding-mode value (immediate control table below lists the encoded values for rounding-mode field).
The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an
SNaN then it will be converted to a QNaN. If DAZ is set to ‘1 then denormals will be converted to zero before
rounding.
The sign of the result of this instruction is preserved, including the sign of zero.
VRNDSCALEPS—Round Packed Float32 Values to Include a Given Number of Fraction Bits Vol. 2C 5-681
VRNDSCALEPS is a more general form of the VEX-encoded VROUNDPS instruction. In VROUNDPS, the formula of
the operation on each element is
ROUND(x) = Round_to_INT(x, round_ctrl),
round_ctrl = imm[3:0];
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Handling of special case of input values are listed in Table 5-29.
Operation
RoundToIntegerSP(SRC[31:0], imm8[7:0]) {
if (imm8[2] = 1)
rounding_direction := MXCSR:RC ; get round control from MXCSR
else
rounding_direction := imm8[1:0] ; get round control from imm8[1:0]
FI
M := imm8[7:4] ; get the scaling factor
case (rounding_direction)
00: TMP[31:0] := round_to_nearest_even_integer(2M*SRC[31:0])
01: TMP[31:0] := round_to_equal_or_smaller_integer(2M*SRC[31:0])
10: TMP[31:0] := round_to_equal_or_larger_integer(2M*SRC[31:0])
11: TMP[31:0] := round_to_nearest_smallest_magnitude_integer(2M*SRC[31:0])
ESAC;
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN DEST[i+31:i] := RoundToIntegerSP(TMP_SRC[i+31:i]), imm8[7:0])
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
VRNDSCALEPS—Round Packed Float32 Values to Include a Given Number of Fraction Bits Vol. 2C 5-682
Intel C/C++ Compiler Intrinsic Equivalent
VRNDSCALEPS __m512 _mm512_roundscale_ps( __m512 a, int imm);
VRNDSCALEPS __m512 _mm512_roundscale_round_ps( __m512 a, int imm, int sae);
VRNDSCALEPS __m512 _mm512_mask_roundscale_ps(__m512 s, __mmask16 k, __m512 a, int imm);
VRNDSCALEPS __m512 _mm512_mask_roundscale_round_ps(__m512 s, __mmask16 k, __m512 a, int imm, int sae);
VRNDSCALEPS __m512 _mm512_maskz_roundscale_ps( __mmask16 k, __m512 a, int imm);
VRNDSCALEPS __m512 _mm512_maskz_roundscale_round_ps( __mmask16 k, __m512 a, int imm, int sae);
VRNDSCALEPS __m256 _mm256_roundscale_ps( __m256 a, int imm);
VRNDSCALEPS __m256 _mm256_mask_roundscale_ps(__m256 s, __mmask8 k, __m256 a, int imm);
VRNDSCALEPS __m256 _mm256_maskz_roundscale_ps( __mmask8 k, __m256 a, int imm);
VRNDSCALEPS __m128 _mm_roundscale_ps( __m256 a, int imm);
VRNDSCALEPS __m128 _mm_mask_roundscale_ps(__m128 s, __mmask8 k, __m128 a, int imm);
VRNDSCALEPS __m128 _mm_maskz_roundscale_ps( __mmask8 k, __m128 a, int imm);
Invalid, Precision.
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).
Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
VRNDSCALEPS—Round Packed Float32 Values to Include a Given Number of Fraction Bits Vol. 2C 5-683
VRNDSCALESD—Round Scalar Float64 Value to Include a Given Number of Fraction Bits
Opcode/ Op / 64/32 CPUID Description
Instruction En bit Mode Feature Flag
Support
EVEX.LLIG.66.0F3A.W1 0B /r ib A V/V AVX512F Rounds scalar double precision floating-point value in
VRNDSCALESD xmm1 {k1}{z}, xmm2, OR AVX10.11 xmm3/m64 to a number of fraction bits specified by
xmm3/m64{sae}, imm8 the imm8 field. Stores the result in xmm1 register.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Rounds a double precision floating-point value in the low quadword (see Figure 5-29) element of the second source
operand (the third operand) by the rounding mode specified in the immediate operand and places the result in the
corresponding element of the destination operand (the first operand) according to the writemask. The quadword
element at bits 127:64 of the destination is copied from the first source operand (the second operand).
The destination and first source operands are XMM registers, the 2nd source operand can be an XMM register or
memory location. Bits MAXVL-1:128 of the destination register are cleared.
The rounding process rounds the input to an integral value, plus number bits of fraction that are specified by
imm8[7:4] (to be included in the result) and returns the result as a double precision floating-point value.
It should be noticed that no overflow is induced while executing this instruction (although the source is scaled by
the imm8[7:4] value).
The immediate operand also specifies control fields for the rounding operation, three bit fields are defined and
shown in the “Immediate Control Description” figure below. Bit 3 of the immediate byte controls the processor
behavior for a precision exception, bit 2 selects the source of rounding mode control. Bits 1:0 specify a non-sticky
rounding-mode value (immediate control table below lists the encoded values for rounding-mode field).
The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an
SNaN then it will be converted to a QNaN. If DAZ is set to ‘1 then denormals will be converted to zero before
rounding.
The sign of the result of this instruction is preserved, including the sign of zero.
EVEX encoded version: The source operand is a XMM register or a 64-bit memory location. The destination operand
is a XMM register.
VRNDSCALESD—Round Scalar Float64 Value to Include a Given Number of Fraction Bits Vol. 2C 5-684
Handling of special case of input values are listed in Table 5-29.
Operation
RoundToIntegerDP(SRC[63:0], imm8[7:0]) {
if (imm8[2] = 1)
rounding_direction := MXCSR:RC ; get round control from MXCSR
else
rounding_direction := imm8[1:0] ; get round control from imm8[1:0]
FI
M := imm8[7:4] ; get the scaling factor
case (rounding_direction)
00: TMP[63:0] := round_to_nearest_even_integer(2M*SRC[63:0])
01: TMP[63:0] := round_to_equal_or_smaller_integer(2M*SRC[63:0])
10: TMP[63:0] := round_to_equal_or_larger_integer(2M*SRC[63:0])
11: TMP[63:0] := round_to_nearest_smallest_magnitude_integer(2M*SRC[63:0])
ESAC
Invalid, Precision.
If SPE is enabled, precision exception is not reported (regardless of MXCSR exception mask).
VRNDSCALESD—Round Scalar Float64 Value to Include a Given Number of Fraction Bits Vol. 2C 5-685
Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”
VRNDSCALESD—Round Scalar Float64 Value to Include a Given Number of Fraction Bits Vol. 2C 5-686
VRNDSCALESH—Round Scalar FP16 Value to Include a Given Number of Fraction Bits
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.LLIG.NP.0F3A.W0 0A /r /ib A V/V AVX512-FP16 Round the low FP16 value in xmm3/m16 to a
VRNDSCALESH xmm1{k1}{z}, xmm2, OR AVX10.11 number of fraction bits specified by the imm8
xmm3/m16 {sae}, imm8 field. Store the result in xmm1 subject to
writemask k1. Bits 127:16 from xmm2 are
copied to xmm1[127:16].
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction rounds the low FP16 value in the second source operand by the rounding mode specified in the
immediate operand (see Table 5-30) and places the result in the destination operand.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
The rounding process rounds the input to an integral value, plus number bits of fraction that are specified by
imm8[7:4] (to be included in the result), and returns the result as a FP16 value.
Note that no overflow is induced while executing this instruction (although the source is scaled by the imm8[7:4]
value).
The immediate operand also specifies control fields for the rounding operation. Three bit fields are defined and
shown in Table 5-30, “Imm8 Controls for VRNDSCALEPH/VRNDSCALESH.” Bit 3 of the immediate byte controls the
processor behavior for a precision exception, bit 2 selects the source of rounding mode control, and bits 1:0 specify
a non-sticky rounding-mode value.
The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an
SNaN then it will be converted to a QNaN.
The sign of the result of this instruction is preserved, including the sign of zero. Special cases are described in Table
5-31.
If this instruction encoding’s SPE bit (bit 3) in the immediate operand is 1, VRNDSCALESH can set MXCSR.UE
without MXCSR.PE.
The formula of the operation on each data element for VRNDSCALESH is:
ROUND(x) = 2−M *Round_to_INT(x * 2M, round_ctrl),
round_ctrl = imm[3:0];
M=imm[7:4];
The operation of x * 2M is computed as if the exponent range is unlimited (i.e., no overflow ever occurs).
VRNDSCALESH—Round Scalar FP16 Value to Include a Given Number of Fraction Bits Vol. 2C 5-686
Operation
VRNDSCALESH dest{k1}, src1, src2, imm8
IF k1[0] or *no writemask*:
DEST.fp16[0] := round_fp16_to_integer(src2.fp16[0], imm8) // see VRNDSCALEPH
ELSE IF *zeroing*:
DEST.fp16[0] := 0
//else DEST.fp16[0] remains unchanged
DEST[127:16] = src1[127:16]
DEST[MAXVL-1:128] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
VRNDSCALESH—Round Scalar FP16 Value to Include a Given Number of Fraction Bits Vol. 2C 5-687
VRNDSCALESS—Round Scalar Float32 Value to Include a Given Number of Fraction Bits
Opcode/ Op / 64/32 CPUID Description
Instruction En bit Mode Feature Flag
Support
EVEX.LLIG.66.0F3A.W0 0A /r ib A V/V AVX512F Rounds scalar single-precision floating-point value in
VRNDSCALESS xmm1 {k1}{z}, xmm2, OR AVX10.11 xmm3/m32 to a number of fraction bits specified by
xmm3/m32{sae}, imm8 the imm8 field. Stores the result in xmm1 register
under writemask.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Rounds the single precision floating-point value in the low doubleword element of the second source operand (the
third operand) by the rounding mode specified in the immediate operand (see Figure 5-29) and places the result in
the corresponding element of the destination operand (the first operand) according to the writemask. The double-
word elements at bits 127:32 of the destination are copied from the first source operand (the second operand).
The destination and first source operands are XMM registers, the 2nd source operand can be an XMM register or
memory location. Bits MAXVL-1:128 of the destination register are cleared.
The rounding process rounds the input to an integral value, plus number bits of fraction that are specified by
imm8[7:4] (to be included in the result) and returns the result as a single precision floating-point value.
It should be noticed that no overflow is induced while executing this instruction (although the source is scaled by
the imm8[7:4] value).
The immediate operand also specifies control fields for the rounding operation, three bit fields are defined and
shown in the “Immediate Control Description” figure below. Bit 3 of the immediate byte controls the processor
behavior for a precision exception, bit 2 selects the source of rounding mode control. Bits 1:0 specify a non-sticky
rounding-mode value (immediate control tables below lists the encoded values for rounding-mode field).
The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an
SNaN then it will be converted to a QNaN. If DAZ is set to ‘1 then denormals will be converted to zero before
rounding.
The sign of the result of this instruction is preserved, including the sign of zero.
VRNDSCALESS—Round Scalar Float32 Value to Include a Given Number of Fraction Bits Vol. 2C 5-688
EVEX encoded version: The source operand is a XMM register or a 32-bit memory location. The destination operand
is a XMM register.
Handling of special case of input values are listed in Table 5-29.
Operation
RoundToIntegerSP(SRC[31:0], imm8[7:0]) {
if (imm8[2] = 1)
rounding_direction := MXCSR:RC ; get round control from MXCSR
else
rounding_direction := imm8[1:0] ; get round control from imm8[1:0]
FI
M := imm8[7:4] ; get the scaling factor
case (rounding_direction)
00: TMP[31:0] := round_to_nearest_even_integer(2M*SRC[31:0])
01: TMP[31:0] := round_to_equal_or_smaller_integer(2M*SRC[31:0])
10: TMP[31:0] := round_to_equal_or_larger_integer(2M*SRC[31:0])
11: TMP[31:0] := round_to_nearest_smallest_magnitude_integer(2M*SRC[31:0])
ESAC;
VRNDSCALESS—Round Scalar Float32 Value to Include a Given Number of Fraction Bits Vol. 2C 5-689
Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”
VRNDSCALESS—Round Scalar Float32 Value to Include a Given Number of Fraction Bits Vol. 2C 5-690
VRSQRT14PD—Compute Approximate Reciprocals of Square Roots of Packed Float64 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 4E /r A V/V (AVX512VL AND Computes the approximate reciprocal square roots of
VRSQRT14PD xmm1 {k1}{z}, AVX512F) OR the packed double precision floating-point values in
xmm2/m128/m64bcst AVX10.11 xmm2/m128/m64bcst and stores the results in
xmm1. Under writemask.
EVEX.256.66.0F38.W1 4E /r A V/V (AVX512VL AND Computes the approximate reciprocal square roots of
VRSQRT14PD ymm1 {k1}{z}, AVX512F) OR the packed double precision floating-point values in
ymm2/m256/m64bcst AVX10.11 ymm2/m256/m64bcst and stores the results in
ymm1. Under writemask.
EVEX.512.66.0F38.W1 4E /r A V/V AVX512F Computes the approximate reciprocal square roots of
VRSQRT14PD zmm1 {k1}{z}, OR AVX10.11 the packed double precision floating-point values in
zmm2/m512/m64bcst zmm2/m512/m64bcst and stores the results in
zmm1 under writemask.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a SIMD computation of the approximate reciprocals of the square roots of the eight
packed double precision floating-point values in the source operand (the second operand) and stores the packed
double precision floating-point results in the destination operand (the first operand) according to the writemask.
The maximum relative error for this approximation is less than 2-14.
EVEX.512 encoded version: The source operand can be a ZMM register, a 512-bit memory location, or a 512-bit
vector broadcasted from a 64-bit memory location. The destination operand is a ZMM register, conditionally
updated using writemask k1.
EVEX.256 encoded version: The source operand is a YMM register, a 256-bit memory location, or a 256-bit vector
broadcasted from a 64-bit memory location. The destination operand is a YMM register, conditionally updated using
writemask k1.
EVEX.128 encoded version: The source operand is a XMM register, a 128-bit memory location, or a 128-bit vector
broadcasted from a 64-bit memory location. The destination operand is a XMM register, conditionally updated using
writemask k1.
The VRSQRT14PD instruction is not affected by the rounding control bits in the MXCSR register. When a source
value is a 0.0, an ∞ with the sign of the source value is returned. When the source operand is an +∞ then +ZERO
value is returned. A denormal source value is treated as zero only if DAZ bit is set in MXCSR. Otherwise it is treated
correctly and performs the approximation with the specified masked response. When a source value is a negative
value (other than 0.0) a floating-point QNaN_indefinite is returned. When a source value is an SNaN or QNaN, the
SNaN is converted to a QNaN or the source QNaN is returned.
MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
A numerically exact implementation of VRSQRT14xx can be found at https://software.intel.com/en-us/arti-
cles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2.
VRSQRT14PD—Compute Approximate Reciprocals of Square Roots of Packed Float64 Values Vol. 2C 5-690
Operation
VRSQRT14PD (EVEX encoded versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN DEST[i+63:i] := APPROXIMATE(1.0/ SQRT(SRC[63:0]));
ELSE DEST[i+63:i] := APPROXIMATE(1.0/ SQRT(SRC[i+63:i]));
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
None.
Other Exceptions
See Table 2-51, “Type E4 Class Exception Conditions.”
VRSQRT14PD—Compute Approximate Reciprocals of Square Roots of Packed Float64 Values Vol. 2C 5-691
VRSQRT14PS—Compute Approximate Reciprocals of Square Roots of Packed Float32 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W0 4E /r A V/V (AVX512VL AND Computes the approximate reciprocal square roots of
VRSQRT14PS xmm1 {k1}{z}, AVX512F) OR the packed single-precision floating-point values in
xmm2/m128/m32bcst AVX10.11 xmm2/m128/m32bcst and stores the results in xmm1.
Under writemask.
EVEX.256.66.0F38.W0 4E /r A V/V (AVX512VL AND Computes the approximate reciprocal square roots of
VRSQRT14PS ymm1 {k1}{z}, AVX512F) OR the packed single-precision floating-point values in
ymm2/m256/m32bcst AVX10.11 ymm2/m256/m32bcst and stores the results in ymm1.
Under writemask.
EVEX.512.66.0F38.W0 4E /r A V/V AVX512F Computes the approximate reciprocal square roots of
VRSQRT14PS zmm1 {k1}{z}, OR AVX10.11 the packed single-precision floating-point values in
zmm2/m512/m32bcst zmm2/m512/m32bcst and stores the results in zmm1.
Under writemask.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a SIMD computation of the approximate reciprocals of the square roots of 16 packed
single precision floating-point values in the source operand (the second operand) and stores the packed single
precision floating-point results in the destination operand (the first operand) according to the writemask. The
maximum relative error for this approximation is less than 2-14.
EVEX.512 encoded version: The source operand can be a ZMM register, a 512-bit memory location or a 512-bit
vector broadcasted from a 32-bit memory location. The destination operand is a ZMM register, conditionally
updated using writemask k1.
EVEX.256 encoded version: The source operand is a YMM register, a 256-bit memory location, or a 256-bit vector
broadcasted from a 32-bit memory location. The destination operand is a YMM register, conditionally updated using
writemask k1.
EVEX.128 encoded version: The source operand is a XMM register, a 128-bit memory location, or a 128-bit vector
broadcasted from a 32-bit memory location. The destination operand is a XMM register, conditionally updated using
writemask k1.
The VRSQRT14PS instruction is not affected by the rounding control bits in the MXCSR register. When a source
value is a 0.0, an ∞ with the sign of the source value is returned. When the source operand is an +∞ then +ZERO
value is returned. A denormal source value is treated as zero only if DAZ bit is set in MXCSR. Otherwise it is treated
correctly and performs the approximation with the specified masked response. When a source value is a negative
value (other than 0.0) a floating-point QNaN_indefinite is returned. When a source value is an SNaN or QNaN, the
SNaN is converted to a QNaN or the source QNaN is returned.
MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported.
Note: EVEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
A numerically exact implementation of VRSQRT14xx can be found at https://software.intel.com/en-us/arti-
cles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2.
VRSQRT14PS—Compute Approximate Reciprocals of Square Roots of Packed Float32 Values Vol. 2C 5-692
Operation
VRSQRT14PS (EVEX encoded versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC *is memory*)
THEN DEST[i+31:i] := APPROXIMATE(1.0/ SQRT(SRC[31:0]));
ELSE DEST[i+31:i] := APPROXIMATE(1.0/ SQRT(SRC[i+31:i]));
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+31:i] := 0
FI;
FI;
ENDFOR;
DEST[MAXVL-1:VL] := 0
None.
Other Exceptions
See Table 2-21, “Type 4 Class Exception Conditions.”
VRSQRT14PS—Compute Approximate Reciprocals of Square Roots of Packed Float32 Values Vol. 2C 5-693
VRSQRT14SD—Compute Approximate Reciprocal of Square Root of Scalar Float64 Value
Opcode/ Op / 64/32 CPUID Description
Instruction En bit Mode Feature Flag
Support
EVEX.LLIG.66.0F38.W1 4F /r A V/V AVX512F Computes the approximate reciprocal square root of the
VRSQRT14SD xmm1 {k1}{z}, OR AVX10.11 scalar double precision floating-point value in
xmm2, xmm3/m64 xmm3/m64 and stores the result in the low quadword
element of xmm1 using writemask k1. Bits[127:64] of
xmm2 is copied to xmm1[127:64].
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Computes the approximate reciprocal of the square roots of the scalar double precision floating-point value in the
low quadword element of the source operand (the second operand) and stores the result in the low quadword
element of the destination operand (the first operand) according to the writemask. The maximum relative error for
this approximation is less than 2-14. The source operand can be an XMM register or a 32-bit memory location. The
destination operand is an XMM register.
Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand. Bits
(MAXVL-1:128) of the destination register are zeroed.
The VRSQRT14SD instruction is not affected by the rounding control bits in the MXCSR register. When a source
value is a 0.0, an ∞ with the sign of the source value is returned. When the source operand is an +∞ then +ZERO
value is returned. A denormal source value is treated as zero only if DAZ bit is set in MXCSR. Otherwise it is treated
correctly and performs the approximation with the specified masked response. When a source value is a negative
value (other than 0.0) a floating-point QNaN_indefinite is returned. When a source value is an SNaN or QNaN, the
SNaN is converted to a QNaN or the source QNaN is returned.
MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported.
A numerically exact implementation of VRSQRT14xx can be found at https://software.intel.com/en-us/arti-
cles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2.
Operation
VRSQRT14SD (EVEX version)
IF k1[0] or *no writemask*
THEN DEST[63:0] := APPROXIMATE(1.0/ SQRT(SRC2[63:0]))
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[63:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[63:0] := 0
FI;
FI;
DEST[127:64] := SRC1[127:64]
DEST[MAXVL-1:128] := 0
VRSQRT14SD—Compute Approximate Reciprocal of Square Root of Scalar Float64 Value Vol. 2C 5-694
Table 5-34. VRSQRT14SD Special Cases
Input value Result value Comments
Any denormal Normal Cannot generate overflow
X= 2-2n 2 n
None.
Other Exceptions
See Table 2-53, “Type E5 Class Exception Conditions.”
VRSQRT14SD—Compute Approximate Reciprocal of Square Root of Scalar Float64 Value Vol. 2C 5-695
VRSQRT14SS—Compute Approximate Reciprocal of Square Root of Scalar Float32 Value
Opcode/ Op / 64/32 CPUID Description
Instruction En bit Mode Feature Flag
Support
EVEX.LLIG.66.0F38.W0 4F /r A V/V AVX512F Computes the approximate reciprocal square root of the
VRSQRT14SS xmm1 {k1}{z}, OR AVX10.11 scalar single-precision floating-point value in xmm3/m32
xmm2, xmm3/m32 and stores the result in the low doubleword element of
xmm1 using writemask k1. Bits[127:32] of xmm2 is
copied to xmm1[127:32].
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Computes of the approximate reciprocal of the square root of the scalar single precision floating-point value in the
low doubleword element of the source operand (the second operand) and stores the result in the low doubleword
element of the destination operand (the first operand) according to the writemask. The maximum relative error for
this approximation is less than 2-14. The source operand can be an XMM register or a 32-bit memory location. The
destination operand is an XMM register.
Bits (127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits
(MAXVL-1:128) of the destination register are zeroed.
The VRSQRT14SS instruction is not affected by the rounding control bits in the MXCSR register. When a source
value is a 0.0, an ∞ with the sign of the source value is returned. When the source operand is an ∞, zero with the
sign of the source value is returned. A denormal source value is treated as zero only if DAZ bit is set in MXCSR.
Otherwise it is treated correctly and performs the approximation with the specified masked response. When a
source value is a negative value (other than 0.0) a floating-point indefinite is returned. When a source value is an
SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN is returned.
MXCSR exception flags are not affected by this instruction and floating-point exceptions are not reported.
A numerically exact implementation of VRSQRT14xx can be found at https://software.intel.com/en-us/arti-
cles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2.
Operation
VRSQRT14SS (EVEX version)
IF k1[0] or *no writemask*
THEN DEST[31:0] := APPROXIMATE(1.0/ SQRT(SRC2[31:0]))
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[31:0] remains unchanged*
ELSE ; zeroing-masking
THEN DEST[31:0] := 0
FI;
FI;
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0
VRSQRT14SS—Compute Approximate Reciprocal of Square Root of Scalar Float32 Value Vol. 2C 5-696
Table 5-35. VRSQRT14SS Special Cases
Input value Result value Comments
Any denormal Normal Cannot generate overflow
X= 2-2n 2 n
None.
Other Exceptions
See Table 2-53, “Type E5 Class Exception Conditions.”
VRSQRT14SS—Compute Approximate Reciprocal of Square Root of Scalar Float32 Value Vol. 2C 5-697
VRSQRTPH—Compute Reciprocals of Square Roots of Packed FP16 Values
Opcode/ Op/ 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.MAP6.W0 4E /r A V/V (AVX512-FP16 Compute the approximate reciprocals of the
VRSQRTPH xmm1{k1}{z}, AND AVX512VL) square roots of packed FP16 values in
xmm2/m128/m16bcst OR AVX10.11 xmm2/m128/m16bcst and store the result in
xmm1 subject to writemask k1.
EVEX.256.66.MAP6.W0 4E /r A V/V (AVX512-FP16 Compute the approximate reciprocals of the
VRSQRTPH ymm1{k1}{z}, AND AVX512VL) square roots of packed FP16 values in
ymm2/m256/m16bcst OR AVX10.11 ymm2/m256/m16bcst and store the result in
ymm1 subject to writemask k1.
EVEX.512.66.MAP6.W0 4E /r A V/V AVX512-FP16 Compute the approximate reciprocals of the
VRSQRTPH zmm1{k1}{z}, OR AVX10.11 square roots of packed FP16 values in
zmm2/m512/m16bcst zmm2/m512/m16bcst and store the result in
zmm1 subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a SIMD computation of the approximate reciprocals square-root of 8/16/32 packed FP16
floating-point values in the source operand (the second operand) and stores the packed FP16 floating-point results
in the destination operand.
The maximum relative error for this approximation is less than 2−11 + 2−14. For special cases, see Table 5-36.
The destination elements are updated according to the writemask.
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.fp16[0]
ELSE:
tsrc := src.fp16[i]
DEST.fp16[i] := APPROXIMATE(1.0 / SQRT(tsrc) )
ELSE IF *zeroing*:
DEST.fp16[i] := 0
//else DEST.fp16[i] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instruction, see Table 2-51, “Type E4 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs the computation of the approximate reciprocal square-root of the low FP16 value in the
second source operand (the third operand) and stores the result in the low word element of the destination operand
(the first operand) according to the writemask k1.
The maximum relative error for this approximation is less than 2−11 + 2−14.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL−1:128 of the destination operand are zeroed.
For special cases, see Table 5-36.
Operation
VRSQRTSH dest{k1}, src1, src2
VL = 128, 256 or 512
KL := VL/16
Other Exceptions
EVEX-encoded instruction, see Table 2-60, “Type E10 Class Exception Conditions.”
VRSQRTSH—Compute Approximate Reciprocal of Square Root of Scalar FP16 Value Vol. 2C 5-700
VSCALEFPD—Scale Packed Float64 Values With Float64 Values
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.128.66.0F38.W1 2C /r A V/V (AVX512VL Scale the packed double precision floating-point
VSCALEFPD xmm1 {k1}{z}, xmm2, AND AVX512F) values in xmm2 using values from
xmm3/m128/m64bcst OR AVX10.11 xmm3/m128/m64bcst. Under writemask k1.
EVEX.256.66.0F38.W1 2C /r A V/V (AVX512VL Scale the packed double precision floating-point
VSCALEFPD ymm1 {k1}{z}, ymm2, AND AVX512F) values in ymm2 using values from
ymm3/m256/m64bcst OR AVX10.11 ymm3/m256/m64bcst. Under writemask k1.
EVEX.512.66.0F38.W1 2C /r A V/V AVX512F Scale the packed double precision floating-point
VSCALEFPD zmm1 {k1}{z}, zmm2, OR AVX10.11 values in zmm2 using values from
zmm3/m512/m64bcst{er} zmm3/m512/m64bcst. Under writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a floating-point scale of the packed double precision floating-point values in the first source operand by
multiplying them by 2 to the power of the double precision floating-point values in second source operand.
The equation of this operation is given by:
zmm1 := zmm2*2floor(zmm3).
Floor(zmm3) means maximum integer value ≤ zmm3.
If the result cannot be represented in double precision, then the proper overflow response (for positive scaling
operand), or the proper underflow response (for negative scaling operand) is issued. The overflow and underflow
responses are dependent on the rounding mode (for IEEE-compliant rounding), as well as on other settings in
MXCSR (exception mask bits, FTZ bit), and on the SAE bit.
The first source operand is a ZMM/YMM/XMM register. The second source operand is a ZMM/YMM/XMM register, a
512/256/128-bit memory location or a 512/256/128-bit vector broadcasted from a 64-bit memory location. The
destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.
Handling of special-case input values are listed in Table 5-37 and Table 5-38.
Src2 Set IE
±NaN +Inf -Inf 0/Denorm/Norm
Src1 ±QNaN QNaN(Src1) +INF +0 QNaN(Src1) IF either source is SNAN
±SNaN QNaN(Src1) QNaN(Src1) QNaN(Src1) QNaN(Src1) YES
±Inf QNaN(Src2) Src1 QNaN_Indefinite Src1 IF Src2 is SNAN or -INF
±0 QNaN(Src2) QNaN_Indefinite Src1 Src1 IF Src2 is SNAN or +INF
Denorm/Norm QNaN(Src2) ±INF (Src1 sign) ±0 (Src1 sign) Compute Result IF Src2 is SNAN
Operation
SCALE(SRC1, SRC2)
{
TMP_SRC2 := SRC2
TMP_SRC1 := SRC1
IF (SRC2 is denormal AND MXCSR.DAZ) THEN TMP_SRC2=0
IF (SRC1 is denormal AND MXCSR.DAZ) THEN TMP_SRC1=0
/* SRC2 is a 64 bits floating-point value */
DEST[63:0] := TMP_SRC1[63:0] * POW(2, Floor(TMP_SRC2[63:0]))
}
VSCALEFPD (EVEX encoded versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
IF (VL = 512) AND (EVEX.b = 1) AND (SRC2 *is register*)
THEN
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(EVEX.RC);
ELSE
SET_ROUNDING_MODE_FOR_THIS_INSTRUCTION(MXCSR.RC);
FI;
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b = 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := SCALE(SRC1[i+63:i], SRC2[63:0]);
ELSE DEST[i+63:i] := SCALE(SRC1[i+63:i], SRC2[i+63:i]);
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE ; zeroing-masking
DEST[i+63:i] := 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
Other Exceptions
See Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a floating-point scale of the packed FP16 values in the first source operand by multiplying
it by 2 to the power of the FP16 values in second source operand. The destination elements are updated according
to the writemask.
The equation of this operation is given by:
zmm1 := zmm2 * 2floor(zmm3).
Floor(zmm3) means maximum integer value ≤ zmm3.
If the result cannot be represented in FP16, then the proper overflow response (for positive scaling operand), or
the proper underflow response (for negative scaling operand), is issued. The overflow and underflow responses are
dependent on the rounding mode (for IEEE-compliant rounding), as well as on other settings in MXCSR (exception
mask bits), and on the SAE bit.
Handling of special-case input values are listed in Table 5-39 and Table 5-40.
Src2
Src1 Set IE
±NaN +INF −INF 0/Denorm/Norm
±QNaN QNaN(Src1) +INF +0 QNaN(Src1) IF either source is SNaN
±SNaN QNaN(Src1) QNaN(Src1) QNaN(Src1) QNaN(Src1) YES
±INF QNaN(Src2) Src1 QNaN_Indefinite Src1 IF Src2 is SNaN or −INF
±0 QNaN(Src2) QNaN_Indefinite Src1 Src1 IF Src2 is SNaN or +INF
Denorm/Norm QNaN(Src2) ±INF (Src1 sign) ±0 (Src1 sign) Compute Result IF Src2 is SNaN
Operation
def scale_fp16(src1,src2):
tmp1 := src1
tmp2 := src2
return tmp1 * POW(2, FLOOR(tmp2))
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC2 is memory and (EVEX.b = 1):
tsrc := src2.fp16[0]
ELSE:
tsrc := src2.fp16[i]
dest.fp16[i] := scale_fp16(src1.fp16[i],tsrc)
ELSE IF *zeroing*:
dest.fp16[i] := 0
//else dest.fp16[i] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions”.
Denormal-operand exception (#D) is checked and signaled for src1 operand, but not for src2 operand. The
denormal-operand exception is checked for src1 operand only if the src2 operand is not NaN. If the src2 operand is
NaN, the processor generates NaN and does not signal denormal-operand exception, even if src1 operand is
denormal.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a floating-point scale of the packed single precision floating-point values in the first source operand by
multiplying them by 2 to the power of the float32 values in second source operand.
The equation of this operation is given by:
zmm1 := zmm2*2floor(zmm3).
Floor(zmm3) means maximum integer value ≤ zmm3.
If the result cannot be represented in single precision, then the proper overflow response (for positive scaling
operand), or the proper underflow response (for negative scaling operand) is issued. The overflow and underflow
responses are dependent on the rounding mode (for IEEE-compliant rounding), as well as on other settings in
MXCSR (exception mask bits, FTZ bit), and on the SAE bit.
EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand is a ZMM
register, a 512-bit memory location or a 512-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a ZMM register conditionally updated with writemask k1.
EVEX.256 encoded version: The first source operand is a YMM register. The second source operand is a YMM
register, a 256-bit memory location, or a 256-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a YMM register, conditionally updated using writemask k1.
EVEX.128 encoded version: The first source operand is an XMM register. The second source operand is a XMM
register, a 128-bit memory location, or a 128-bit vector broadcasted from a 32-bit memory location. The destina-
tion operand is a XMM register, conditionally updated using writemask k1.
Handling of special-case input values are listed in Table 5-37 and Table 5-41.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a floating-point scale of the scalar double precision floating-point value in the first source operand by
multiplying it by 2 to the power of the double precision floating-point value in second source operand.
The equation of this operation is given by:
xmm1 := xmm2*2floor(xmm3).
Floor(xmm3) means maximum integer value ≤ xmm3.
If the result cannot be represented in double precision, then the proper overflow response (for positive scaling
operand), or the proper underflow response (for negative scaling operand) is issued. The overflow and underflow
responses are dependent on the rounding mode (for IEEE-compliant rounding), as well as on other settings in
MXCSR (exception mask bits, FTZ bit), and on the SAE bit.
EVEX encoded version: The first source operand is an XMM register. The second source operand is an XMM register
or a memory location. The destination operand is an XMM register conditionally updated with writemask k1.
Handling of special-case input values are listed in Table 5-37 and Table 5-38.
Operation
SCALE(SRC1, SRC2)
{
; Check for denormal operands
TMP_SRC2 := SRC2
TMP_SRC1 := SRC1
IF (SRC2 is denormal AND MXCSR.DAZ) THEN TMP_SRC2=0
IF (SRC1 is denormal AND MXCSR.DAZ) THEN TMP_SRC1=0
/* SRC2 is a 64 bits floating-point value */
DEST[63:0] := TMP_SRC1[63:0] * POW(2, Floor(TMP_SRC2[63:0]))
}
Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a floating-point scale of the low FP16 element in the first source operand by multiplying it
by 2 to the power of the low FP16 element in second source operand, storing the result in the low element of the
destination operand.
Bits 127:16 of the destination operand are copied from the corresponding bits of the first source operand. Bits
MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the destination is updated according
to the writemask.
The equation of this operation is given by:
xmm1 := xmm2 * 2floor(xmm3).
Floor(xmm3) means maximum integer value ≤ xmm3.
If the result cannot be represented in FP16, then the proper overflow response (for positive scaling operand), or
the proper underflow response (for negative scaling operand), is issued. The overflow and underflow responses are
dependent on the rounding mode (for IEEE-compliant rounding), as well as on other settings in MXCSR (exception
mask bits, FTZ bit), and on the SAE bit.
Handling of special-case input values are listed in Table 5-39 and Table 5-40.
Operation
VSCALEFSH dest{k1}, src1, src2
IF (EVEX.b = 1) and no memory operand:
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)
DEST[127:16] := src1[127:16]
DEST[MAXVL-1:128] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
Denormal-operand exception (#D) is checked and signaled for src1 operand, but not for src2 operand. The
denormal-operand exception is checked for src1 operand only if the src2 operand is not NaN. If the src2 operand is
NaN, the processor generates NaN and does not signal denormal-operand exception, even if src1 operand is
denormal.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a floating-point scale of the scalar single precision floating-point value in the first source operand by
multiplying it by 2 to the power of the float32 value in second source operand.
The equation of this operation is given by:
xmm1 := xmm2*2floor(xmm3).
Floor(xmm3) means maximum integer value ≤ xmm3.
If the result cannot be represented in single precision, then the proper overflow response (for positive scaling
operand), or the proper underflow response (for negative scaling operand) is issued. The overflow and underflow
responses are dependent on the rounding mode (for IEEE-compliant rounding), as well as on other settings in
MXCSR (exception mask bits, FTZ bit), and on the SAE bit.
EVEX encoded version: The first source operand is an XMM register. The second source operand is an XMM register
or a memory location. The destination operand is an XMM register conditionally updated with writemask k1.
Handling of special-case input values are listed in Table 5-37 and Table 5-41.
Operation
SCALE(SRC1, SRC2)
{
; Check for denormal operands
TMP_SRC2 := SRC2
TMP_SRC1 := SRC1
IF (SRC2 is denormal AND MXCSR.DAZ) THEN TMP_SRC2=0
IF (SRC1 is denormal AND MXCSR.DAZ) THEN TMP_SRC1=0
/* SRC2 is a 32 bits floating-point value */
DEST[31:0] := TMP_SRC1[31:0] * POW(2, Floor(TMP_SRC2[31:0]))
}
Other Exceptions
See Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD—Scatter Packed Single Precision, Packed Double Precision Floating- Vol. 2C 5-715
Description
Stores up to four, eight, or 16 single precision elements (or two, four, or eight double precision elements) in double-
word/quadword vector xmm1, ymm1, or zmm1, to the memory locations pointed by base address BASE_ADDR
and index vector VINDEX, with scale SCALE. The elements are specified via the VSIB (i.e., the index register is a
vector register, holding packed indices). Elements will only be stored if their corresponding mask bit is one. The
entire mask register will be set to zero by this instruction unless it triggers an exception.
This instruction can be suspended by an exception if at least one element is already scattered (i.e., if the exception
is triggered by an element other than the rightmost one with its mask bit set). When this happens, the destination
register and the mask register (k1) are partially updated. If any traps or interrupts are pending from already scat-
tered elements, they will be delivered in lieu of the exception; in this case, EFLAG.RF is set to one so an instruction
breakpoint is not re-triggered when the instruction is continued.
Note that:
• Only writes to overlapping vector indices are guaranteed to be ordered with respect to each other (from LSB to
MSB of the source registers). Note that this also include partially overlapping vector indices. Writes that are not
overlapped may happen in any order. Memory ordering with other instructions follows the Intel-64 memory
ordering model. Note that this does not account for non-overlapping indices that map into the same physical
address locations.
• If two or more destination indices completely overlap, the “earlier” write(s) may be skipped.
• Faults are delivered in a right-to-left manner. That is, if a fault is triggered by an element and delivered, all
elements closer to the LSB of the source register xmm, ymm, or zmm will be completed (and non-faulting).
Individual elements closer to the MSB may or may not be completed. If a given element triggers multiple faults,
they are delivered in the conventional order.
• Elements may be scattered in any order, but faults must be delivered in a right-to left order; thus, elements to
the left of a faulting one may be scattered before the fault is delivered. A given implementation of this
instruction is repeatable - given the same input values and architectural state, the same set of elements to the
left of the faulting one will be scattered.
• This instruction does not perform AC checks, and so will never deliver an AC fault.
• Not valid with 16-bit effective addresses. Will deliver a #UD fault.
• If this instruction overwrites itself and then takes a fault, only a subset of elements may be completed before
the fault is delivered (as described above). If the fault handler completes and attempts to re-execute this
instruction, the new instruction will be executed, and the scatter will not complete.
Note that the presence of VSIB byte is enforced in this instruction. Hence, the instruction will #UD fault if
ModRM.rm is different than 100b.
This instruction has special disp8*N and alignment rules. N is considered to be the size of a single vector element.
The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit
mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are
ignored.
The instruction will #UD fault if the k0 mask register is specified.
Operation
BASE_ADDR stands for the memory operand base address (a GPR); may not exist
VINDEX stands for the memory operand vector of indices (a ZMM register)
SCALE stands for the memory operand scalar (1, 2, 4 or 8)
DISP is the optional 1 or 4 byte displacement
VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD—Scatter Packed Single Precision, Packed Double Precision Floating- Vol. 2C 5-716
VSCATTERDPS (EVEX encoded versions)
(KL, VL)= (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask*
THEN MEM[BASE_ADDR +SignExtend(VINDEX[i+31:i]) * SCALE + DISP] :=
SRC[i+31:i]
k1[j] := 0
FI;
ENDFOR
k1[MAX_KL-1:KL] := 0
VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD—Scatter Packed Single Precision, Packed Double Precision Floating- Vol. 2C 5-717
Intel C/C++ Compiler Intrinsic Equivalent
VSCATTERDPD void _mm512_i32scatter_pd(void * base, __m512i vdx, __m512d a, int scale);
VSCATTERDPD void _mm512_mask_i32scatter_pd(void * base, __mmask8 k, __m512i vdx, __m512d a, int scale);
VSCATTERDPS void _mm512_i32scatter_ps(void * base, __m512i vdx, __m512 a, int scale);
VSCATTERDPS void _mm512_mask_i32scatter_ps(void * base, __mmask16 k, __m512i vdx, __m512 a, int scale);
VSCATTERQPD void _mm512_i64scatter_pd(void * base, __m512i vdx, __m512d a, int scale);
VSCATTERQPD void _mm512_mask_i64scatter_pd(void * base, __mmask8 k, __m512i vdx, __m512d a, int scale);
VSCATTERQPS void _mm512_i64scatter_ps(void * base, __m512i vdx, __m512 a, int scale);
VSCATTERQPS void _mm512_mask_i64scatter_ps(void * base, __mmask8 k, __m512i vdx, __m512 a, int scale);
VSCATTERDPD void _mm256_i32scatter_pd(void * base, __m256i vdx, __m256d a, int scale);
VSCATTERDPD void _mm256_mask_i32scatter_pd(void * base, __mmask8 k, __m256i vdx, __m256d a, int scale);
VSCATTERDPS void _mm256_i32scatter_ps(void * base, __m256i vdx, __m256 a, int scale);
VSCATTERDPS void _mm256_mask_i32scatter_ps(void * base, __mmask8 k, __m256i vdx, __m256 a, int scale);
VSCATTERQPD void _mm256_i64scatter_pd(void * base, __m256i vdx, __m256d a, int scale);
VSCATTERQPD void _mm256_mask_i64scatter_pd(void * base, __mmask8 k, __m256i vdx, __m256d a, int scale);
VSCATTERQPS void _mm256_i64scatter_ps(void * base, __m256i vdx, __m256 a, int scale);
VSCATTERQPS void _mm256_mask_i64scatter_ps(void * base, __mmask8 k, __m256i vdx, __m256 a, int scale);
VSCATTERDPD void _mm_i32scatter_pd(void * base, __m128i vdx, __m128d a, int scale);
VSCATTERDPD void _mm_mask_i32scatter_pd(void * base, __mmask8 k, __m128i vdx, __m128d a, int scale);
VSCATTERDPS void _mm_i32scatter_ps(void * base, __m128i vdx, __m128 a, int scale);
VSCATTERDPS void _mm_mask_i32scatter_ps(void * base, __mmask8 k, __m128i vdx, __m128 a, int scale);
VSCATTERQPD void _mm_i64scatter_pd(void * base, __m128i vdx, __m128d a, int scale);
VSCATTERQPD void _mm_mask_i64scatter_pd(void * base, __mmask8 k, __m128i vdx, __m128d a, int scale);
VSCATTERQPS void _mm_i64scatter_ps(void * base, __m128i vdx, __m128 a, int scale);
VSCATTERQPS void _mm_mask_i64scatter_ps(void * base, __mmask8 k, __m128i vdx, __m128 a, int scale);
Other Exceptions
See Table 2-63, “Type E12 Class Exception Conditions.”
VSCATTERDPS/VSCATTERDPD/VSCATTERQPS/VSCATTERQPD—Scatter Packed Single Precision, Packed Double Precision Floating- Vol. 2C 5-718
VSHUFF32x4/VSHUFF64x2/VSHUFI32x4/VSHUFI64x2—Shuffle Packed Values at 128-Bit
Granularity
Opcode/ Op / 64/32 CPUID Feature Description
Instruction En bit Mode Flag
Support
EVEX.256.66.0F3A.W0 23 /r ib A V/V (AVX512VL AND Shuffle 128-bit packed single-precision floating-
VSHUFF32X4 ymm1{k1}{z}, ymm2, AVX512F) OR point values selected by imm8 from ymm2 and
ymm3/m256/m32bcst, imm8 AVX10.11 ymm3/m256/m32bcst and place results in ymm1
subject to writemask k1.
EVEX.512.66.0F3A.W0 23 /r ib A V/V AVX512F Shuffle 128-bit packed single-precision floating-
VSHUFF32x4 zmm1{k1}{z}, zmm2, OR AVX10.11 point values selected by imm8 from zmm2 and
zmm3/m512/m32bcst, imm8 zmm3/m512/m32bcst and place results in zmm1
subject to writemask k1.
EVEX.256.66.0F3A.W1 23 /r ib A V/V (AVX512VL AND Shuffle 128-bit packed double precision floating-
VSHUFF64X2 ymm1{k1}{z}, ymm2, AVX512F) OR point values selected by imm8 from ymm2 and
ymm3/m256/m64bcst, imm8 AVX10.11 ymm3/m256/m64bcst and place results in ymm1
subject to writemask k1.
EVEX.512.66.0F3A.W1 23 /r ib A V/V AVX512F Shuffle 128-bit packed double precision floating-
VSHUFF64x2 zmm1{k1}{z}, zmm2, OR AVX10.11 point values selected by imm8 from zmm2 and
zmm3/m512/m64bcst, imm8 zmm3/m512/m64bcst and place results in zmm1
subject to writemask k1.
EVEX.256.66.0F3A.W0 43 /r ib A V/V (AVX512VL AND Shuffle 128-bit packed double-word values
VSHUFI32X4 ymm1{k1}{z}, ymm2, AVX512F) OR selected by imm8 from ymm2 and
ymm3/m256/m32bcst, imm8 AVX10.11 ymm3/m256/m32bcst and place results in ymm1
subject to writemask k1.
EVEX.512.66.0F3A.W0 43 /r ib A V/V AVX512F Shuffle 128-bit packed double-word values
VSHUFI32x4 zmm1{k1}{z}, zmm2, OR AVX10.11 selected by imm8 from zmm2 and
zmm3/m512/m32bcst, imm8 zmm3/m512/m32bcst and place results in zmm1
subject to writemask k1.
EVEX.256.66.0F3A.W1 43 /r ib A V/V (AVX512VL AND Shuffle 128-bit packed quad-word values selected
VSHUFI64X2 ymm1{k1}{z}, ymm2, AVX512F) OR by imm8 from ymm2 and ymm3/m256/m64bcst
ymm3/m256/m64bcst, imm8 AVX10.11 and place results in ymm1 subject to writemask k1.
EVEX.512.66.0F3A.W1 43 /r ib A V/V AVX512F Shuffle 128-bit packed quad-word values selected
VSHUFI64x2 zmm1{k1}{z}, zmm2, OR AVX10.11 by imm8 from zmm2 and zmm3/m512/m64bcst
zmm3/m512/m64bcst, imm8 and place results in zmm1 subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
256-bit Version: Moves one of the two 128-bit packed single precision floating-point values from the first source
operand (second operand) into the low 128-bit of the destination operand (first operand); moves one of the two
packed 128-bit floating-point values from the second source operand (third operand) into the high 128-bit of the
destination operand. The selector operand (third operand) determines which values are moved to the destination
operand.
Operation
Select2(SRC, control) {
CASE (control[0]) OF
0: TMP := SRC[127:0];
1: TMP := SRC[255:128];
ESAC;
RETURN TMP
}
Select4(SRC, control) {
CASE (control[1:0]) OF
0: TMP := SRC[127:0];
1: TMP := SRC[255:128];
2: TMP := SRC[383:256];
3: TMP := SRC[511:384];
ESAC;
RETURN TMP
}
Other Exceptions
See Table 2-52, “Type E4NF Class Exception Conditions.”
Additionally:
#UD If EVEX.L’L = 0 for VSHUFF32x4/VSHUFF64x2.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a packed FP16 square-root computation on the values from source operand and stores
the packed FP16 result in the destination operand. The destination elements are updated according to the write-
mask.
Operation
VSQRTPH dest{k1}, src
VL = 128, 256 or 512
KL := VL/16
FOR i := 0 to KL-1:
IF k1[i] or *no writemask*:
IF SRC is memory and (EVEX.b = 1):
tsrc := src.fp16[0]
ELSE:
tsrc := src.fp16[i]
DEST.fp16[i] := SQRT(tsrc)
ELSE IF *zeroing*:
DEST.fp16[i] := 0
//else DEST.fp16[i] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction performs a scalar FP16 square-root computation on the source operand and stores the FP16 result
in the destination operand. Bits 127:16 of the destination operand are copied from the corresponding bits of the
first source operand. Bits MAXVL-1:128 of the destination operand are zeroed. The low FP16 element of the desti-
nation is updated according to the writemask.
Operation
VSQRTSH dest{k1}, src1, src2
IF k1[0] or *no writemask*:
DEST.fp16[0] := SQRT(src2.fp16[0])
ELSE IF *zeroing*:
DEST.fp16[0] := 0
//else DEST.fp16[0] remains unchanged
DEST[127:16] := src1[127:16]
DEST[MAXVL-1:128] := 0
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction subtracts packed FP16 values from second source operand from the corresponding elements in the
first source operand, storing the packed FP16 result in the destination operand. The destination elements are
updated according to the writemask.
Operation
VSUBPH (EVEX encoded versions) when src2 operand is a register
VL = 128, 256 or 512
KL := VL/16
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
DEST.fp16[j] := SRC1.fp16[j] - SRC2.fp16[j]
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
FOR j := 0 TO KL-1:
IF k1[j] OR *no writemask*:
IF EVEX.b = 1:
DEST.fp16[j] := SRC1.fp16[j] - SRC2.fp16[0]
ELSE:
DEST.fp16[j] := SRC1.fp16[j] - SRC2.fp16[j]
ELSE IF *zeroing*:
DEST.fp16[j] := 0
// else dest.fp16[j] remains unchanged
DEST[MAXVL-1:VL] := 0
Other Exceptions
EVEX-encoded instruction, see Table 2-48, “Type E2 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction subtracts the low FP16 value from the second source operand from the corresponding value in the
first source operand, storing the FP16 result in the destination operand. Bits 127:16 of the destination operand are
copied from the corresponding bits of the first source operand. Bits MAXVL-1:128 of the destination operand are
zeroed. The low FP16 element of the destination is updated according to the writemask.
Operation
VSUBSH (EVEX encoded versions)
IF EVEX.b = 1 and SRC2 is a register:
SET_RM(EVEX.RC)
ELSE
SET_RM(MXCSR.RC)
Other Exceptions
EVEX-encoded instructions, see Table 2-49, “Type E3 Class Exception Conditions.”
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
This instruction compares the FP16 values in the low word of operand 1 (first operand) and operand 2 (second
operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unordered, greater than,
less than, or equal). The OF, SF and AF flags in the EFLAGS register are set to 0. The unordered result is returned
if either source operand is a NaN (QNaN or SNaN).
Operand 1 is an XMM register; operand 2 can be an XMM register or a 16-bit memory location.
The VUCOMISH instruction differs from the VCOMISH instruction in that it signals a SIMD floating-point invalid oper-
ation exception (#I) only if a source operand is an SNaN. The COMISS instruction signals an invalid numeric excep-
tion when a source operand is either a QNaN or SNaN.
The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated. EVEX.vvvv are
reserved and must be 1111b, otherwise instructions will #UD.
Operation
VUCOMISH
RESULT := UnorderedCompare(SRC1.fp16[0],SRC2.fp16[0])
if RESULT is UNORDERED:
ZF, PF, CF := 1, 1, 1
else if RESULT is GREATER_THAN:
ZF, PF, CF := 0, 0, 0
else if RESULT is LESS_THAN:
ZF, PF, CF := 0, 0, 1
else: // RESULT is EQUALS
ZF, PF, CF := 1, 0, 0
OF, AF, SF := 0, 0, 0
VUCOMISH—Unordered Compare Scalar FP16 Values and Set EFLAGS Vol. 2C 5-749
SIMD Floating-Point Exceptions
Invalid, Denormal.
Other Exceptions
EVEX-encoded instructions, see Table 2-50, “Type E3NF Class Exception Conditions.”
VUCOMISH—Unordered Compare Scalar FP16 Values and Set EFLAGS Vol. 2C 5-750
9. Updates to Chapter 6, Volume 2D
Change bars and violet text show changes to Chapter 6 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 2D: Instruction Set Reference, W-Z.
------------------------------------------------------------------------------------------
Changes to this chapter:
• Added the following instructions: WRMSRLIST and WRMSRNS.
• Added Intel® AVX10.1 information to the following instructions:
— XORPD
— XORPS
Description
This instruction writes a software-provided list of up to 64 MSRs with values loaded from memory.
WRMSRLIST takes three implied input operands:
• RSI: Linear address of a table of MSR addresses (8 bytes per address)1.
• RDI: Linear address of a table from which MSR data is loaded (8 bytes per MSR).
• RCX: 64-bit bitmask of valid bits for the MSRs. Bit 0 is the valid bit for entry 0 in each table, etc.
For each RCX bit [n] from 0 to 63, if RCX[n] is 1, WRMSRLIST will write the MSR specified at entry [n] in the RSI-
based table with the value read from memory at the entry [n] in the RDI-based table.
This implies a maximum of 64 MSRs that can be processed by this instruction. The processor will clear RCX[n] after
it finishes handling that MSR. Similar to repeated string operations, WRMSRLIST supports partial completion for
interrupts, exceptions, and traps. In these situations, the RIP register saved will point to the MSRLIST instruction
while the RCX register will have cleared bits corresponding to all completed iterations.
This instruction must be executed at privilege level 0; otherwise, a general protection exception #GP(0) is gener-
ated. This instruction performs MSR-specific checks in the same manner as WRMSR.
Like WRMSRNS (and unlike WRMSR), WRMSRLIST is not defined as a serializing instruction (see “Serializing
Instructions” in Chapter 9 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). This
means that software should not rely on WRMSRLIST to drain all buffered writes to memory before the next instruc-
tion is fetched and executed. For implementation reasons, some processors may serialize when writing certain
MSRs, even though that is not guaranteed.
Like WRMSR and WRMSRNS, WRMSRLIST ensures that all operations before WRMSRLIST do not use any new MSR
value and that all operations after WRMSRLIST do use the new values. An exception to this rule is certain store
related performance-monitor events that only count stores when they are drained to memory. Since WRMSRLIST
is not a serializing instruction, if software uses WRMSRLIST to change the controls for such performance-monitor
events, stores issued before WRMSRLIST may be counted based on the controls established by WRMSRLIST. Soft-
ware can insert the SERIALIZE instruction before the WRMSRLIST if so desired.
Those MSRs that cause a TLB invalidation when they are written via WRMSR (e.g., MTRRs) will also cause the same
TLB invalidation when written by WRMSRLIST.
In places where WRMSR is being used as a proxy for a serializing instruction, a different serializing instruction can
be used (e.g., SERIALIZE).
WRMSRLIST writes MSRs in order, which means the processor will ensure that an MSR in iteration “n” will be
written only after previous iterations (“n-1”). If the older MSR writes had a side effect that affects the behavior of
the next MSR, the processor will ensure that side effect is honored.
The processor is allowed (but not required) to “load ahead” in the list. The following are examples of things the
processor may do:
• Use an old memory type or TLB entry for loads or stores to memory containing the tables despite an MSR
written by a previous iteration changing MTRR or invalidating TLBs.
1. Since MSR addresses are only 32-bits wide, bits 63:32 of each MSR address table entry is reserved.
Operation
DO WHILE RCX != 0
MSR_index := position of least significant bit set in RCX;
Load MSR_address_table_entry from 8 bytes at the linear address RSI + (MSR_index * 8);
IF MSR_address_table_entry[63:32] != 0 THEN #GP(0); FI;
MSR_address := MSR_address_table_entry[31:0];
Load MSR_data from 8 bytes at the linear address RDI + (MSR_index * 8);
IF WRMSR of MSR_data to the MSR with address MSR_address would #GP THEN #GP(0); FI;
Load the MSR with address MSR_address with MSR_data;
RCX[MSR_index] := 0;
Allow delivery of any pending interrupts or traps;
OD;
Flags Affected
None.
1. For example, the processor may take a page fault due to a linear address for the 10th entry in the MSR address table despite only
having completed the MSR writes up to entry 5.
Description
WRMSRNS is an instruction that behaves like WRMSR except that it is not a serializing instruction by default. It can
be executed only at privilege level 0 or in real-address mode; otherwise, a general protection exception #GP(0) is
generated.
The instruction writes the contents of registers EDX:EAX into the 64-bit model specific register (MSR) specified in
the ECX register. The contents of the EDX register are copied to the high-order 32 bits of the selected MSR and the
contents of the EAX register are copied to the low-order 32 bits of the MSR. The high-order 32 bits of RAX, RCX,
and RDX are ignored.
Unlike WRMSR, WRMSRNS is not defined as a serializing instruction (see “Serializing Instructions” in Chapter 9 of
the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). This means that software should
not rely on it to drain all buffered writes to memory before the next instruction is fetched and executed. For imple-
mentation reasons, some processors may serialize when writing certain MSRs, even though that is not guaranteed.
Like WRMSR, WRMSRNS will ensure that all operations before it do not use the new MSR value and that all opera-
tions after the WRMSRNS do use the new value. An exception to this rule is certain store related performance-
monitor events that only count stores when they are drained to memory. Since WRMSRNS is not a serializing
instruction, if software uses WRMSRNS to change the controls for such performance-monitor events, stores issued
before WRMSRMS may be counted based on the controls established by WRMSRNS. Software can insert the
SERIALIZE instruction before the WRMSRNS if so desired.
Those MSRs that cause a TLB invalidation when they are written via WRMSR (e.g., MTRRs) will also cause the same
TLB invalidation when written by WRMSRNS.
In order to improve performance, software may replace WRMSR with WRMSRNS. In places where WRMSR is being
used as a proxy for a serializing instruction, a different serializing instruction can be used (e.g., SERIALIZE).
Operation
MSR[ECX] := EDX:EAX;
Flags Affected
None.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vec-
tor width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a bitwise logical XOR of the two, four or eight packed double precision floating-point values from the first
source operand and the second source operand, and stores the result in the destination operand.
EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand can be a ZMM
register or a vector memory location. The destination operand is a ZMM register conditionally updated with write-
mask k1.
VEX.256 and EVEX.256 encoded versions: The first source operand is a YMM register. The second source operand
is a YMM register or a 256-bit memory location. The destination operand is a YMM register (conditionally updated
with writemask k1 in case of EVEX). The upper bits (MAXVL-1:256) of the corresponding ZMM register destination
are zeroed.
VEX.128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand
is an XMM register or 128-bit memory location. The destination operand is an XMM register (conditionally updated
with writemask k1 in case of EVEX). The upper bits (MAXVL-1:128) of the corresponding ZMM register destination
are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
register destination are unmodified.
XORPD—Bitwise Logical XOR of Packed Double Precision Floating-Point Values Vol. 2D 6-42
Operation
VXORPD (EVEX Encoded Versions)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j := 0 TO KL-1
i := j * 64
IF k1[j] OR *no writemask* THEN
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN DEST[i+63:i] := SRC1[i+63:i] BITWISE XOR SRC2[63:0];
ELSE DEST[i+63:i] := SRC1[i+63:i] BITWISE XOR SRC2[i+63:i];
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+63:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+63:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
XORPD—Bitwise Logical XOR of Packed Double Precision Floating-Point Values Vol. 2D 6-43
Other Exceptions
Non-EVEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E4 Class Exception Conditions.”
XORPD—Bitwise Logical XOR of Packed Double Precision Floating-Point Values Vol. 2D 6-44
XORPS—Bitwise Logical XOR of Packed Single Precision Floating-Point Values
Opcode/ Op / 64/32 bit CPUID Feature Description
Instruction En Mode Flag
Support
NP 0F 57 /r A V/V SSE Return the bitwise logical XOR of packed single
XORPS xmm1, xmm2/m128 precision floating-point values in xmm1 and
xmm2/mem.
VEX.128.0F.WIG 57 /r B V/V AVX Return the bitwise logical XOR of packed single
VXORPS xmm1,xmm2, xmm3/m128 precision floating-point values in xmm2 and
xmm3/mem.
VEX.256.0F.WIG 57 /r B V/V AVX Return the bitwise logical XOR of packed single
VXORPS ymm1, ymm2, ymm3/m256 precision floating-point values in ymm2 and
ymm3/mem.
EVEX.128.0F.W0 57 /r C V/V (AVX512VL AND Return the bitwise logical XOR of packed single-
VXORPS xmm1 {k1}{z}, xmm2, AVX512DQ) OR precision floating-point values in xmm2 and
xmm3/m128/m32bcst AVX10.11 xmm3/m128/m32bcst subject to writemask k1.
EVEX.256.0F.W0 57 /r C V/V (AVX512VL AND Return the bitwise logical XOR of packed single-
VXORPS ymm1 {k1}{z}, ymm2, AVX512DQ) OR precision floating-point values in ymm2 and
ymm3/m256/m32bcst AVX10.11 ymm3/m256/m32bcst subject to writemask k1.
EVEX.512.0F.W0 57 /r C V/V AVX512DQ Return the bitwise logical XOR of packed single-
VXORPS zmm1 {k1}{z}, zmm2, OR AVX10.11 precision floating-point values in zmm2 and
zmm3/m512/m32bcst zmm3/m512/m32bcst subject to writemask k1.
NOTES:
1. For instructions with a CPUID feature flag specifying AVX10, the programmer must check the available vector options on the proces-
sor at run-time via CPUID Leaf 24H, the Intel AVX10 Converged Vector ISA Leaf. This leaf enumerates the maximum supported vector
width and as such will determine the set of instructions available to the programmer listed in the above opcode table.
Description
Performs a bitwise logical XOR of the four, eight or sixteen packed single precision floating-point values from the
first source operand and the second source operand, and stores the result in the destination operand
EVEX.512 encoded version: The first source operand is a ZMM register. The second source operand can be a ZMM
register or a vector memory location. The destination operand is a ZMM register conditionally updated with write-
mask k1.
VEX.256 and EVEX.256 encoded versions: The first source operand is a YMM register. The second source operand
is a YMM register or a 256-bit memory location. The destination operand is a YMM register (conditionally updated
with writemask k1 in case of EVEX). The upper bits (MAXVL-1:256) of the corresponding ZMM register destination
are zeroed.
VEX.128 and EVEX.128 encoded versions: The first source operand is an XMM register. The second source operand
is an XMM register or 128-bit memory location. The destination operand is an XMM register (conditionally updated
with writemask k1 in case of EVEX). The upper bits (MAXVL-1:128) of the corresponding ZMM register destination
are zeroed.
128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The desti-
nation is not distinct from the first source XMM register and the upper bits (MAXVL-1:128) of the corresponding
register destination are unmodified.
XORPS—Bitwise Logical XOR of Packed Single Precision Floating-Point Values Vol. 2D 6-45
Operation
VXORPS (EVEX Encoded Versions)
(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j := 0 TO KL-1
i := j * 32
IF k1[j] OR *no writemask* THEN
IF (EVEX.b == 1) AND (SRC2 *is memory*)
THEN DEST[i+31:i] := SRC1[i+31:i] BITWISE XOR SRC2[31:0];
ELSE DEST[i+31:i] := SRC1[i+31:i] BITWISE XOR SRC2[i+31:i];
FI;
ELSE
IF *merging-masking* ; merging-masking
THEN *DEST[i+31:i] remains unchanged*
ELSE *zeroing-masking* ; zeroing-masking
DEST[i+31:i] = 0
FI
FI;
ENDFOR
DEST[MAXVL-1:VL] := 0
XORPS—Bitwise Logical XOR of Packed Single Precision Floating-Point Values Vol. 2D 6-46
Intel C/C++ Compiler Intrinsic Equivalent
VXORPS __m512 _mm512_xor_ps (__m512 a, __m512 b);
VXORPS __m512 _mm512_mask_xor_ps (__m512 a, __mmask16 m, __m512 b);
VXORPS __m512 _mm512_maskz_xor_ps (__mmask16 m, __m512 a);
VXORPS __m256 _mm256_xor_ps (__m256 a, __m256 b);
VXORPS __m256 _mm256_mask_xor_ps (__m256 a, __mmask8 m, __m256 b);
VXORPS __m256 _mm256_maskz_xor_ps (__mmask8 m, __m256 a);
XORPS __m128 _mm_xor_ps (__m128 a, __m128 b);
VXORPS __m128 _mm_mask_xor_ps (__m128 a, __mmask8 m, __m128 b);
VXORPS __m128 _mm_maskz_xor_ps (__mmask8 m, __m128 a);
Other Exceptions
Non-EVEX-encoded instructions, see Table 2-21, “Type 4 Class Exception Conditions.”
EVEX-encoded instructions, see Table 2-49, “Type E4 Class Exception Conditions.”
XORPS—Bitwise Logical XOR of Packed Single Precision Floating-Point Values Vol. 2D 6-47
10.Updates to Chapter 1, Volume 3A
Change bars and violet text show changes to Chapter 1 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.
------------------------------------------------------------------------------------------
Changes to this chapter:
• Updated Section 1.1, “Overview of the System Programming Guide,” with the newly added Chapter 4, “Linear-
Address Pre-Processing.”
The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A: System Programming Guide, Part
1 (order number 253668), the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B: System
Programming Guide, Part 2 (order number 253669), the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3C: System Programming Guide, Part 3 (order number 326019), and the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 3D:System Programming Guide, Part 4 (order number
332831) are part of a set that describes the architecture and programming environment of Intel 64 and IA-32
Architecture processors. The other volumes in this set are:
• Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture (order number
253665).
• Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B, 2C, & 2D: Instruction Set
Reference (order numbers 253666, 253667, 326018, and 334569).
• The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4: Model-Specific Registers
(order number 335592).
The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, describes the basic architecture
and programming environment of Intel 64 and IA-32 processors. The Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volumes 2A, 2B, 2C, & 2D, describe the instruction set of the processor and the opcode struc-
ture. These volumes apply to application programmers and to programmers who write operating systems or exec-
utives. The Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 3A, 3B, 3C, & 3D, describe
the operating-system support environment of Intel 64 and IA-32 processors. These volumes target operating-
system and BIOS designers. In addition, Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume
3B, and Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3C, address the programming
environment for classes of software that host operating systems. The Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 4, describes the model-specific registers of Intel 64 and IA-32 processors.
Vol. 3A 1-1
ABOUT THIS MANUAL
programming the LINT0 and LINT1 inputs and gives an example of how to program the LINT0 and LINT1 pins for
specific interrupt vectors.
Chapter 8 — User Interrupts. Describes user interrupts supported by Intel 64 and IA-32 processors.
Chapter 9 — Task Management. Describes mechanisms the Intel 64 and IA-32 architectures provide to support
multitasking and inter-task protection.
Chapter 10 — Multiple-Processor Management. Describes the instructions and flags that support multiple
processors with shared memory, memory ordering, and Intel® Hyper-Threading Technology. Includes MP initializa-
tion for P6 family processors and gives an example of how to use the MP protocol to boot P6 family processors in an
MP system.
Chapter 11 — Processor Management and Initialization. Defines the state of an Intel 64 or IA-32 processor
after reset initialization. This chapter also explains how to set up an Intel 64 or IA-32 processor for real-address
mode operation and protected- mode operation, and how to switch between modes.
Chapter 12 — Advanced Programmable Interrupt Controller (APIC). Describes the programming interface
to the local APIC and gives an overview of the interface between the local APIC and the I/O APIC. Includes APIC bus
message formats and describes the message formats for messages transmitted on the APIC bus for P6 family and
Pentium processors.
Chapter 13 — Memory Cache Control. Describes the general concept of caching and the caching mechanisms
supported by the Intel 64 or IA-32 architectures. This chapter also describes the memory type range registers
(MTRRs) and how they can be used to map memory types of physical memory. Information on using the new cache
control and memory streaming instructions introduced with the Pentium III, Pentium 4, and Intel Xeon processors is
also given.
Chapter 14 — Intel® MMX™ Technology System Programming. Describes those aspects of the Intel® MMX™
technology that must be handled and considered at the system programming level, including: task switching,
exception handling, and compatibility with existing system environments.
Chapter 15 — System Programming For Instruction Set Extensions And Processor Extended States.
Describes the operating system requirements to support SSE/SSE2/SSE3/SSSE3/SSE4 extensions, including task
switching, exception handling, and compatibility with existing system environments. The latter part of this chapter
describes the extensible framework of operating system requirements to support processor extended states.
Processor extended state may be required by instruction set extensions beyond those of
SSE/SSE2/SSE3/SSSE3/SSE4 extensions.
Chapter 16 — Power and Thermal Management. Describes facilities of Intel 64 and IA-32 architecture used for
power management and thermal monitoring.
Chapter 17 — Machine-Check Architecture. Describes the machine-check architecture and machine-check
exception mechanism found in the Pentium 4, Intel Xeon, and P6 family processors. Additionally, a signaling mech-
anism for software to respond to hardware corrected machine check error is covered.
Chapter 18 — Interpreting Machine-Check Error Codes. Gives an example of how to interpret the error codes
for a machine-check error that occurred on a P6 family processor.
Chapter 19 — Debug, Branch Profile, TSC, and Resource Monitoring Features. Describes the debugging
registers and other debug mechanism provided in Intel 64 or IA-32 processors. This chapter also describes the
time-stamp counter.
Chapter 20 — Last Branch Records. Describes the Last Branch Records (architectural feature).
Chapter 21 — Performance Monitoring. Describes the Intel 64 and IA-32 architectures’ facilities for monitoring
performance.
Chapter 22 — 8086 Emulation. Describes the real-address and virtual-8086 modes of the IA-32 architecture.
Chapter 23 — Mixing 16-Bit and 32-Bit Code. Describes how to mix 16-bit and 32-bit code modules within the
same program or task.
Chapter 24 — IA-32 Architecture Compatibility. Describes architectural compatibility among IA-32 proces-
sors.
Chapter 25 — Introduction to Virtual Machine Extensions. Describes the basic elements of virtual machine
architecture and the virtual machine extensions for Intel 64 and IA-32 Architectures.
1-2 Vol. 3A
ABOUT THIS MANUAL
Chapter 26 — Virtual Machine Control Structures. Describes components that manage VMX operation. These
include the working-VMCS pointer and the controlling-VMCS pointer.
Chapter 27 — VMX Non-Root Operation. Describes the operation of a VMX non-root operation. Processor oper-
ation in VMX non-root mode can be restricted programmatically such that certain operations, events or conditions
can cause the processor to transfer control from the guest (running in VMX non-root mode) to the monitor software
(running in VMX root mode).
Chapter 28 — VM Entries. Describes VM entries. VM entry transitions the processor from the VMM running in
VMX root-mode to a VM running in VMX non-root mode. VM-Entry is performed by the execution of VMLAUNCH or
VMRESUME instructions.
Chapter 29 — VM Exits. Describes VM exits. Certain events, operations or situations while the processor is in VMX
non-root operation may cause VM-exit transitions. In addition, VM exits can also occur on failed VM entries.
Chapter 30 — VMX Support for Address Translation. Describes virtual-machine extensions that support
address translation and the virtualization of physical memory.
Chapter 31 — APIC Virtualization and Virtual Interrupts. Describes the VMCS including controls that enable
the virtualization of interrupts and the Advanced Programmable Interrupt Controller (APIC).
Chapter 32 — VMX Instruction Reference. Describes the virtual-machine extensions (VMX). VMX is intended
for a system executive to support virtualization of processor hardware and a system software layer acting as a host
to multiple guest software environments.
Chapter 33 — System Management Mode. Describes Intel 64 and IA-32 architectures’ system management
mode (SMM) facilities.
Chapter 34 — Intel® Processor Trace. Describes details of Intel® Processor Trace.
Chapter 35 — Introduction to Intel® Software Guard Extensions. Provides an overview of the Intel® Soft-
ware Guard Extensions (Intel® SGX) set of instructions.
Chapter 36 — Enclave Access Control and Data Structures. Describes Enclave Access Control procedures and
defines various Intel SGX data structures.
Chapter 37 — Enclave Operation. Describes enclave creation and initialization, adding pages and measuring an
enclave, and enclave entry and exit.
Chapter 38 — Enclave Exiting Events. Describes enclave-exiting events (EEE) and asynchronous enclave exit
(AEX).
Chapter 39 — SGX Instruction References. Describes the supervisor and user level instructions provided by
Intel SGX.
Chapter 40 — Intel® SGX Interactions with IA32 and Intel® 64 Architecture. Describes the Intel SGX
collection of enclave instructions for creating protected execution environments on processors supporting IA32 and
Intel 64 architectures.
Chapter 41 — Enclave Code Debug and Profiling. Describes enclave code debug processes and options.
Appendix A — VMX Capability Reporting Facility. Describes the VMX capability MSRs. Support for specific VMX
features is determined by reading capability MSRs.
Appendix B — Field Encoding in VMCS. Enumerates all fields in the VMCS and their encodings. Fields are
grouped by width (16-bit, 32-bit, etc.) and type (guest-state, host-state, etc.).
Appendix C — VM Basic Exit Reasons. Describes the 32-bit fields that encode reasons for a VM exit. Examples
of exit reasons include, but are not limited to: software interrupts, processor exceptions, software traps, NMIs,
external interrupts, and triple faults.
Vol. 3A 1-3
ABOUT THIS MANUAL
1-4 Vol. 3A
11.Updates to Chapter 2, Volume 3A
Change bars and violet text show changes to Chapter 2 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.
------------------------------------------------------------------------------------------
Changes to this chapter:
• Added new LAM bits for CR3 and CR4. Bit 61 of CR3 is “LAM_U57.” Bit 62 of CR3 is “LAM_U48.” Bit 28 of CR4
is “LAM_SUP.” Figure 2-7 was updated to include these bits as well.
IA-32 architecture (beginning with the Intel386 processor family) provides extensive support for operating-system
and system-development software. This support offers multiple modes of operation, which include:
• Real mode, protected mode, virtual 8086 mode, and system management mode. These are sometimes
referred to as legacy modes.
Intel 64 architecture supports almost all the system programming facilities available in IA-32 architecture and
extends them to a new operating mode (IA-32e mode) that supports a 64-bit programming environment. IA-32e
mode allows software to operate in one of two sub-modes:
• 64-bit mode supports 64-bit OS and 64-bit applications
• Compatibility mode allows most legacy software to run; it co-exists with 64-bit applications under a 64-bit OS.
The IA-32 system-level architecture includes features to assist in the following operations:
• Memory management.
• Protection of software modules.
• Multitasking.
• Exception and interrupt handling.
• Multiprocessing.
• Cache management.
• Hardware resource and power management.
• Debugging and performance monitoring.
This chapter provides a description of each part of this architecture. It also describes the system registers that are
used to set up and control the processor at the system level and gives a brief overview of the processor’s system-
level (operating system) instructions.
Many features of the system-level architecture are used only by system programmers. However, application
programmers may need to read this chapter and the following chapters in order to create a reliable and secure
environment for application programs.
This overview and most subsequent chapters of this book focus on protected-mode operation of the IA-32 architec-
ture. IA-32e mode operation of the Intel 64 architecture, as it differs from protected mode operation, is also
described.
All Intel 64 and IA-32 processors enter real-address mode following a power-up or reset (see Chapter 11,
“Processor Management and Initialization”). Software then initiates the switch from real-address mode to
protected mode. If IA-32e mode operation is desired, software also initiates a switch from protected mode to IA-
32e mode.
Vol. 3A 2-1
SYSTEM ARCHITECTURE OVERVIEW
Linear Addr.
Page Directory Page Table Page
Physical Addr.
Pg. Dir. Entry Pg. Tbl. Entry
2-2 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW
RFLAGS
Physical Address
Code, Data or Stack
Control Register Linear Address Segment (Base =0)
CR8 Task-State
CR4 Segment Selector Segment (TSS)
CR3
CR2 Register
CR1
CR0 Global Descriptor
Task Register Table (GDT)
PKRU
Linear Addr.
PML4 Pg. Dir. Ptr. Page Dir. Page Table Page
Physical
PML4. Pg. Dir. Page Tbl Addr.
Entry Entry Entry
Figure 2-2. System-Level Registers and Data Structures in IA-32e Mode and 4-Level Paging
Vol. 3A 2-3
SYSTEM ARCHITECTURE OVERVIEW
Each segment descriptor has an associated segment selector. A segment selector provides the software that uses
it with an index into the GDT or LDT (the offset of its associated segment descriptor), a global/local flag (deter-
mines whether the selector points to the GDT or the LDT), and access rights information.
To access a byte in a segment, a segment selector and an offset must be supplied. The segment selector provides
access to the segment descriptor for the segment (in the GDT or LDT). From the segment descriptor, the processor
obtains the base address of the segment in the linear address space. The offset then provides the location of the
byte relative to the base address. This mechanism can be used to access any valid code, data, or stack segment,
provided the segment is accessible from the current privilege level (CPL) at which the processor is operating. The
CPL is defined as the protection level of the currently executing code segment.
See Figure 2-1. The solid arrows in the figure indicate a linear address, dashed lines indicate a segment selector,
and the dotted arrows indicate a physical address. For simplicity, many of the segment selectors are shown as
direct pointers to a segment. However, the actual path from a segment selector to its associated segment is always
through a GDT or LDT.
The linear address of the base of the GDT is contained in the GDT register (GDTR); the linear address of the LDT is
contained in the LDT register (LDTR).
1. The word “procedure” is commonly used in this document as a general term for a logical unit or block of code (such as a program, pro-
cedure, function, or routine).
2-4 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW
Vol. 3A 2-5
SYSTEM ARCHITECTURE OVERVIEW
2-6 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW
The model-specific registers (MSRs) are a group of registers available primarily to operating-system or executive
procedures (that is, code running at privilege level 0). These registers control items such as the debug extensions,
the performance-monitoring counters, the machine- check architecture, and the memory type ranges (MTRRs).
The number and function of these registers varies among different members of the Intel 64 and IA-32 processor
families. See also: Section 11.4, “Model-Specific Registers (MSRs),” and Chapter 2, “Model-Specific Registers
(MSRs),” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4.
Most systems restrict access to system registers (other than the EFLAGS register) by application programs.
Systems can be designed, however, where all programs and procedures run at the most privileged level (privilege
level 0). In such a case, application programs would be allowed to modify the system registers.
Vol. 3A 2-7
SYSTEM ARCHITECTURE OVERVIEW
interrupt (SMI). In SMM, the processor switches to a separate address space while saving the context of the
currently running program or task. SMM-specific code may then be executed transparently. Upon returning
from SMM, the processor is placed back into its state prior to the SMI.
• Virtual-8086 mode — In protected mode, the processor supports a quasi-operating mode known as virtual-
8086 mode. This mode allows the processor execute 8086 software in a protected, multitasking environment.
Intel 64 architecture supports all operating modes of IA-32 architecture and IA-32e modes:
• IA-32e mode — In IA-32e mode, the processor supports two sub-modes: compatibility mode and 64-bit
mode. 64-bit mode provides 64-bit linear addressing and support for physical address space larger than 64
GBytes. Compatibility mode allows most legacy protected-mode applications to run unchanged.
Figure 2-3 shows how the processor moves between operating modes.
SMI#
Real-Address
Mode
Reset
or
Reset or RSM
PE=1
PE=0
SMI#
Reset Protected Mode
RSM System
Management
LME=1, CR0.PG=1* SMI# Mode
See**
IA-32e RSM
Mode
VM=0 VM=1
* See Section 10.8.5
SMI# ** See Section 10.8.5.4
Virtual-8086
Mode
RSM
The processor is placed in real-address mode following power-up or a reset. The PE flag in control register CR0 then
controls whether the processor is operating in real-address or protected mode. See also: Section 11.9, “Mode
Switching,” and Section 5.1.2, “Paging-Mode Enabling.”
The VM flag in the EFLAGS register determines whether the processor is operating in protected mode or virtual-
8086 mode. Transitions between protected mode and virtual-8086 mode are generally carried out as part of a task
switch or a return from an interrupt or exception handler. See also: Section 22.2.5, “Entering Virtual-8086 Mode.”
The LMA bit (IA32_EFER.LMA[bit 10]) determines whether the processor is operating in IA-32e mode. When
running in IA-32e mode, 64-bit or compatibility sub-mode operation is determined by CS.L bit of the code segment.
The processor enters into IA-32e mode from protected mode by enabling paging and setting the LME bit
(IA32_EFER.LME[bit 8]). See also: Chapter 11, “Processor Management and Initialization.”
The processor switches to SMM whenever it receives an SMI while the processor is in real-address, protected,
virtual-8086, or IA-32e modes. Upon execution of the RSM instruction, the processor always returns to the mode
it was in when the SMI occurred.
2-8 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW
63 12 11 10 9 8 7 1 0
IA32_EFER
SYSCALL Enable
Reserved
Vol. 3A 2-9
SYSTEM ARCHITECTURE OVERVIEW
POPF, POPFD, or IRET instruction, a debug exception is generated after the instruction that follows the
POPF, POPFD, or IRET.
31 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
I
V V
I I I A V R 0 N O O D I T S Z A P C
Reserved (set to 0) D C M F T P F F F F F F 0 F 0 F 1 F
P F
L
ID — Identification Flag
VIP — Virtual Interrupt Pending
VIF — Virtual Interrupt Flag
AC — Alignment Check / Access Control
VM — Virtual-8086 Mode
RF — Resume Flag
NT — Nested Task Flag
IOPL— I/O Privilege Level
IF — Interrupt Enable Flag
TF — Trap Flag
Reserved
IF Interrupt enable (bit 9) — Controls the response of the processor to maskable hardware interrupt
requests (see also: Section 7.3.2, “Maskable Hardware Interrupts”). The flag is set to respond to maskable
hardware interrupts; cleared to inhibit maskable hardware interrupts. The IF flag does not affect the gener-
ation of exceptions or nonmaskable interrupts (NMI interrupts). The CPL, IOPL, and the state of the VME
flag in control register CR4 determine whether the IF flag can be modified by the CLI, STI, POPF, POPFD,
and IRET.
IOPL I/O privilege level field (bits 12 and 13) — Indicates the I/O privilege level (IOPL) of the currently
running program or task. The CPL of the currently running program or task must be less than or equal to
the IOPL to access the I/O address space. The POPF and IRET instructions can modify this field only when
operating at a CPL of 0.
The IOPL is also one of the mechanisms that controls the modification of the IF flag and the handling of
interrupts in virtual-8086 mode when virtual mode extensions are in effect (when CR4.VME = 1). See also:
Chapter 20, “Input/Output,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume
1.
NT Nested task (bit 14) — Controls the chaining of interrupted and called tasks. The processor sets this flag
on calls to a task initiated with a CALL instruction, an interrupt, or an exception. It examines and modifies
this flag on returns from a task initiated with the IRET instruction. The flag can be explicitly set or cleared
with the POPF/POPFD instructions; however, changing to the state of this flag can generate unexpected
exceptions in application programs.
See also: Section 9.4, “Task Linking.”
RF Resume (bit 16) — Controls the processor’s response to instruction-breakpoint conditions. When set, this
flag temporarily disables debug exceptions (#DB) from being generated for instruction breakpoints
(although other exception conditions can cause an exception to be generated). When clear, instruction
breakpoints will generate debug exceptions.
The primary function of the RF flag is to allow the restarting of an instruction following a debug exception
that was caused by an instruction breakpoint condition. Here, debug software must set this flag in the
EFLAGS image on the stack just prior to returning to the interrupted program with IRETD (to prevent the
instruction breakpoint from causing another debug exception). The processor then automatically clears
this flag after the instruction returned to has been successfully executed, enabling instruction breakpoint
faults again.
See also: Section 19.3.1.1, “Instruction-Breakpoint Exception Condition.”
VM Virtual-8086 mode (bit 17) — Set to enable virtual-8086 mode; clear to return to protected mode.
2-10 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW
Vol. 3A 2-11
SYSTEM ARCHITECTURE OVERVIEW
2-12 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW
Vol. 3A 2-13
SYSTEM ARCHITECTURE OVERVIEW
table and PML5 table, respectively. If PCIDs are enabled, CR3 has a format different from that illustrated in
Figure 2-7. See Section 5.5, “4-Level Paging and 5-Level Paging.”
When linear-address masking is supported, CR3 includes two bits that control the masking of user pointers
(see Section 4.4, “Linear-Address Masking.”
See also: Chapter 5, “Paging.”
• CR4 — Contains a group of flags that enable several architectural extensions, and indicate operating system or
executive support for specific processor capabilities. Bits CR4[63:32] can only be used for IA-32e mode only
features that are enabled after entering 64-bit mode. Bits CR4[63:32] do not have any effect outside of IA-32e
mode.
• CR8 — Provides read and write access to the Task Priority Register (TPR). It specifies the priority threshold
value that operating systems use to control the priority class of external interrupts allowed to interrupt the
processor. This register is available only in 64-bit mode. However, interrupt filtering continues to apply in
compatibility mode.
31 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
U P
S S S V L U
I P C P C P P M P P T P V
M M K M M A M D
Reserved N K E K I C G C A S S V M CR4
A E L X X 5 I E
T S T E D E E E E E D I E
P P E E 7 P
R E
63 62 61 12 11 5 4 3 2 0
P P
Page-Directory Base Reserved C W Reserved CR3
D T
LAM_U57
LAM_U48
63 0
31 30 29 28 19 18 17 16 15 6 5 4 3 2 1 0
P C N A W N E T E M P
Reserved Reserved CR0
G D W M P E T S M P E
Reserved
2-14 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW
Vol. 3A 2-15
SYSTEM ARCHITECTURE OVERVIEW
actually executed by the new task. The processor sets this flag on every task switch and tests it when
executing x87 FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instructions.
• If the TS flag is set and the EM flag (bit 2 of CR0) is clear, a device-not-available exception (#NM) is
raised prior to the execution of any x87 FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instruction; with the
exception of PAUSE, PREFETCHh, SFENCE, LFENCE, MFENCE, MOVNTI, CLFLUSH, CRC32, and POPCNT.
See the paragraph below for the special case of the WAIT/FWAIT instructions.
• If the TS flag is set and the MP flag (bit 1 of CR0) and EM flag are clear, an #NM exception is not raised
prior to the execution of an x87 FPU WAIT/FWAIT instruction.
• If the EM flag is set, the setting of the TS flag has no effect on the execution of x87
FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instructions.
Table 2-2 shows the actions taken when the processor encounters an x87 FPU instruction based on the
settings of the TS, EM, and MP flags. Table 14-1 and 15-1 show the actions taken when the processor
encounters an MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instruction.
The processor does not automatically save the context of the x87 FPU, XMM, and MXCSR registers on a
task switch. Instead, it sets the TS flag, which causes the processor to raise an #NM exception whenever it
encounters an x87 FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instruction in the instruction stream for the
new task (with the exception of the instructions listed above).
The fault handler for the #NM exception can then be used to clear the TS flag (with the CLTS instruction)
and save the context of the x87 FPU, XMM, and MXCSR registers. If the task never encounters an x87
FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instruction, the x87 FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4
context is never saved.
Table 2-2. Action Taken By x87 FPU Instructions for Different Combinations of EM, MP, and TS
CR0 Flags x87 FPU Instruction Type
EM MP TS Floating-Point WAIT/FWAIT
0 0 0 Execute Execute.
0 0 1 #NM Exception Execute.
0 1 0 Execute Execute.
0 1 1 #NM Exception #NM exception.
1 0 0 #NM Exception Execute.
1 0 1 #NM Exception Execute.
1 1 0 #NM Exception Execute.
1 1 1 #NM Exception #NM exception.
CR0.EM
Emulation (bit 2 of CR0) — Indicates that the processor does not have an internal or external x87 FPU when set;
indicates an x87 FPU is present when clear. This flag also affects the execution of
MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instructions.
When the EM flag is set, execution of an x87 FPU instruction generates a device-not-available exception
(#NM). This flag must be set when the processor does not have an internal x87 FPU or is not connected to
an external math coprocessor. Setting this flag forces all floating-point instructions to be handled by soft-
ware emulation. Table 11-3 shows the recommended setting of this flag, depending on the IA-32 processor
and x87 FPU or math coprocessor present in the system. Table 2-2 shows the interaction of the EM, MP, and
TS flags.
Also, when the EM flag is set, execution of an MMX instruction causes an invalid-opcode exception (#UD)
to be generated (see Table 14-1). Thus, if an IA-32 or Intel 64 processor incorporates MMX technology, the
EM flag must be set to 0 to enable execution of MMX instructions.
Similarly for SSE/SSE2/SSE3/SSSE3/SSE4 extensions, when the EM flag is set, execution of most
SSE/SSE2/SSE3/SSSE3/SSE4 instructions causes an invalid opcode exception (#UD) to be generated (see
2-16 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW
1. Earlier versions of this manual used the term “IA-32e paging” to identify 4-level paging.
Vol. 3A 2-17
SYSTEM ARCHITECTURE OVERVIEW
NOTE
CPUID feature flag FXSR indicates availability of the FXSAVE/FXRSTOR instructions. The OSFXSR
bit provides operating system software with a means of enabling FXSAVE/FXRSTOR to save/restore
the contents of the X87 FPU, XMM, and MXCSR registers. Consequently OSFXSR bit indicates that
the operating system provides context switch support for SSE/SSE2/SSE3/SSSE3/SSE4.
CR4.OSXMMEXCPT
Operating System Support for Unmasked SIMD Floating-Point Exceptions (bit 10 of CR4) —
When set, indicates that the operating system supports the handling of unmasked SIMD floating-point
exceptions through an exception handler that is invoked when a SIMD floating-point exception (#XM) is
2-18 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW
generated. SIMD floating-point exceptions are only generated by SSE/SSE2/SSE3/SSE4.1 SIMD floating-
point instructions.
The operating system or executive must explicitly set this flag. If this flag is not set, the processor will
generate an invalid opcode exception (#UD) whenever it detects an unmasked SIMD floating-point excep-
tion.
CR4.UMIP
User-Mode Instruction Prevention (bit 11 of CR4) — When set, the following instructions cannot be
executed if CPL > 0: SGDT, SIDT, SLDT, SMSW, and STR. An attempt at such execution causes a general-
protection exception (#GP).
CR4.LA57
57-bit linear addresses (bit 12 of CR4) — When set in IA-32e mode, the processor uses 5-level paging
to translate 57-bit linear addresses. When clear in IA-32e mode, the processor uses 4-level paging to
translate 48-bit linear addresses. This bit cannot be modified in IA-32e mode.
See also: Chapter 5, “Paging.”
CR4.VMXE
VMX-Enable Bit (bit 13 of CR4) — Enables VMX operation when set. See Chapter 25, “Introduction to
Virtual Machine Extensions.”
CR4.SMXE
SMX-Enable Bit (bit 14 of CR4) — Enables SMX operation when set. See Chapter 7, “Safer Mode Exten-
sions Reference,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2D.
CR4.FSGSBASE
FSGSBASE-Enable Bit (bit 16 of CR4) — Enables the instructions RDFSBASE, RDGSBASE, WRFSBASE,
and WRGSBASE.
CR4.PCIDE
PCID-Enable Bit (bit 17 of CR4) — Enables process-context identifiers (PCIDs) when set. See Section
5.10.1, “Process-Context Identifiers (PCIDs).” Applies only in IA-32e mode (if IA32_EFER.LMA = 1).
CR4.OSXSAVE
XSAVE and Processor Extended States-Enable Bit (bit 18 of CR4) — When set, this flag: (1) indi-
cates (via CPUID.01H:ECX.OSXSAVE[bit 27]) that the operating system supports the use of the XGETBV,
XSAVE, and XRSTOR instructions by general software; (2) enables the XSAVE and XRSTOR instructions to
save and restore the x87 FPU state (including MMX registers), the SSE state (XMM registers and MXCSR),
along with other processor extended states enabled in XCR0; (3) enables the processor to execute XGETBV
and XSETBV instructions in order to read and write XCR0. See Section 2.6 and Chapter 15, “System
Programming for Instruction Set Extensions and Processor Extended States.”
CR4.KL
Key-Locker-Enable Bit (bit 19 of CR4) — When set, the LOADIWKEY instruction is enabled; in addition,
if support for the AES Key Locker instructions has been activated by system firmware,
CPUID.19H:EBX.AESKLE[bit 0] is enumerated as 1 and the AES Key Locker instructions are enabled.1
When clear, CPUID.19H:EBX.AESKLE[bit 0] is enumerated as 0 and execution of any Key Locker instruction
causes an invalid-opcode exception (#UD).
CR4.SMEP
SMEP-Enable Bit (bit 20 of CR4) — Enables supervisor-mode execution prevention (SMEP) when set.
See Section 5.6, “Access Rights.”
CR4.SMAP
SMAP-Enable Bit (bit 21 of CR4) — Enables supervisor-mode access prevention (SMAP) when set. See
Section 5.6, “Access Rights.”
CR4.PKE
Enable protection keys for user-mode pages (bit 22 of CR4) — 4-level paging and 5-level paging
1. Software can check CPUID.19H:EBX.AESKLE[bit 0] after setting CR4.KL to determine whether the AES Key Locker instructions have
been enabled. Note that some processors may allow enabling of those instructions without activation by system firmware. Some
processors may not support use of the AES Key Locker instructions in system-management mode (SMM). Those processors enumer-
ate CPUID.19H:EBX.AESKLE[bit 0] as 0 in SMM regardless of the setting of CR4.KL.
Vol. 3A 2-19
SYSTEM ARCHITECTURE OVERVIEW
associate each user-mode linear address with a protection key. When set, this flag indicates (via
CPUID.(EAX=07H,ECX=0H):ECX.OSPKE [bit 4]) that the operating system supports use of the PKRU
register to specify, for each protection key, whether user-mode linear addresses with that protection key
can be read or written. This bit also enables access to the PKRU register using the RDPKRU and WRPKRU
instructions.
CR4.CET
Control-flow Enforcement Technology (bit 23 of CR4) — Enables control-flow enforcement tech-
nology when set. See Chapter 18, “Control-flow Enforcement Technology (CET)‚” of the IA-32 Intel® Archi-
tecture Software Developer’s Manual, Volume 1. This flag can be set only if CR0.WP is set, and it must be
clear before CR0.WP can be cleared (see below).
CR4.PKS
Enable protection keys for supervisor-mode pages (bit 24 of CR4) — 4-level paging and 5-level
paging associate each supervisor-mode linear address with a protection key. When set, this flag allows use
of the IA32_PKRS MSR to specify, for each protection key, whether supervisor-mode linear addresses with
that protection key can be read or written.
CR4.UINTR
User Interrupts Enable Bit (bit 25 of CR4) — Enables user interrupts when set, including user-interrupt
delivery, user-interrupt notification identification, and the user-interrupt instructions.
CR4.LAM_SUP
Supervisor LAM enable (bit 28 of CR4) — When set, enables LAM (linear-address masking) for super-
visor pointers. See Section 4.4, “Linear-Address Masking.”
CR8.TPL
Task Priority Level (bit 3:0 of CR8) — This sets the threshold value corresponding to the highest-
priority interrupt to be blocked. A value of 0 means all interrupts are enabled. This field is available in 64-
bit mode. A value of 15 means all interrupts will be disabled.
2-20 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW
63 18 17 10 9 8 7 6 5 4 3 2 1 0
Reserved (must be 0)
Vol. 3A 2-21
SYSTEM ARCHITECTURE OVERVIEW
• XCR0.TILEDATA (bit 18): If 1, and if XCR0.TILECFG is also 1, Intel AMX instructions can be executed and the
XSAVE feature set can be used to manage TILEDATA.
An attempt to use XSETBV to write to XCR0 results in general-protection exceptions (#GP) if it would do any of the
following:
• Set a bit reserved in XCR0 for a given processor (as determined by the contents of EAX and EDX after executing
CPUID with EAX=0DH, ECX= 0H).
• Clear XCR0.x87.
• Clear XCR0.SSE and set XCR0.AVX.
• Clear XCR0.AVX and set any of XCR0.opmask, XCR0.ZMM_Hi256, or XCR0.Hi16_ZMM.
• Set either XCR0.BNDREG or XCR0.BNDCSR while not setting the other.
• Set any of XCR0.opmask, XCR0.ZMM_Hi256, and XCR0.Hi16_ZMM while not setting all of them.
• Set either XCR0.TILECFG or XCR0.TILEDATA while not setting the other.
After reset, all bits (except bit 0) in XCR0 are cleared to zero; XCR0[0] is set to 1.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Bit Position
W A W A W A W A W A W A W A W A W A W A W A W A W A W A W A W A
D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D
15 15 14 14 13 13 12 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0
2-22 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW
Vol. 3A 2-23
SYSTEM ARCHITECTURE OVERVIEW
2-24 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW
The LAR (load access rights) instruction verifies the accessibility of a specified segment and loads access rights
information from the segment’s segment descriptor into a general-purpose register. Software can then examine
the access rights to determine if the segment type is compatible with its intended use. See Section 6.10.1,
“Checking Access Rights (LAR Instruction),” for a detailed explanation of the function and use of this instruction.
The LSL (load segment limit) instruction verifies the accessibility of a specified segment and loads the segment
limit from the segment’s segment descriptor into a general-purpose register. Software can then compare the
segment limit with an offset into the segment to determine whether the offset lies within the segment. See Section
6.10.3, “Checking That the Pointer Offset Is Within Limits (LSL Instruction),” for a detailed explanation of the func-
tion and use of this instruction.
The VERR (verify for reading) and VERW (verify for writing) instructions verify if a selected segment is readable or
writable, respectively, at a given CPL. See Section 6.10.2, “Checking Read/Write Rights (VERR and VERW Instruc-
tions),” for a detailed explanation of the function and use of these instructions.
Vol. 3A 2-25
SYSTEM ARCHITECTURE OVERVIEW
QPI
DDR3
The INVLPG (invalidate TLB entry) instruction invalidates (flushes) the TLB entry for a specified page.
The HLT (halt processor) instruction stops the processor until an enabled interrupt (such as NMI or SMI, which are
normally enabled), a debug exception, the BINIT# signal, the INIT# signal, or the RESET# signal is received. The
processor generates a special bus cycle to indicate that the halt mode has been entered.
Hardware may respond to this signal in a number of ways. An indicator light on the front panel may be turned on.
An NMI interrupt for recording diagnostic information may be generated. Reset initialization may be invoked (note
that the BINIT# pin was introduced with the Pentium Pro processor). If any non-wake events are pending during
shutdown, they will be handled after the wake event from shutdown is processed (for example, A20M# interrupts).
The LOCK prefix invokes a locked (atomic) read-modify-write operation when modifying a memory operand. This
mechanism is used to allow reliable communications between processors in multiprocessor systems, as described
below:
• In the Pentium processor and earlier IA-32 processors, the LOCK prefix causes the processor to assert the
LOCK# signal during the instruction. This always causes an explicit bus lock to occur.
• In the Pentium 4, Intel Xeon, and P6 family processors, the locking operation is handled with either a cache lock
or bus lock. If a memory access is cacheable and affects only a single cache line, a cache lock is invoked and
the system bus and the actual memory location in system memory are not locked during the operation. Here,
other Pentium 4, Intel Xeon, or P6 family processors on the bus write-back any modified data and invalidate
their caches as necessary to maintain system memory coherency. If the memory access is not cacheable
and/or it crosses a cache line boundary, the processor’s LOCK# signal is asserted and the processor does not
respond to requests for bus control during the locked operation.
The RSM (return from SMM) instruction restores the processor (from a context dump) to the state it was in prior to
a system management mode (SMM) interrupt.
2-26 Vol. 3A
SYSTEM ARCHITECTURE OVERVIEW
of programmable and fixed-function performance monitoring counters for each processor generation are described
in Chapter 20, “Last Branch Records.”
The programmable performance counters can support counting either the occurrence or duration of events. Events
that can be monitored on programmable counters generally are model specific (except for architectural perfor-
mance events enumerated by CPUID leaf 0AH); they may include the number of instructions decoded, interrupts
received, or the number of cache loads. Individual counters can be set up to monitor different events. Use the
system instruction WRMSR to set up values in one of the IA32_PERFEVTSELx MSR, in one of the 45 ESCRs and one
of the 18 CCCR MSRs (for Pentium 4 and Intel Xeon processors); or in the PerfEvtSel0 or the PerfEvtSel1 MSR (for
the P6 family processors). The RDPMC instruction loads the current count from the selected counter into the
EDX:EAX registers.
Fixed-function performance counters record only specific events that are defined at: https://perfmon-
events.intel.com/, and the width/number of fixed-function counters are enumerated by CPUID leaf 0AH.
The time-stamp counter is a model-specific 64-bit counter that is reset to zero each time the processor is reset. If
not reset, the counter will increment ~9.5 x 1016 times per year when the processor is operating at a clock rate
of 3GHz. At this clock frequency, it would take over 190 years for the counter to wrap around. The RDTSC
instruction loads the current count of the time-stamp counter into the EDX:EAX registers.
See Section 21.1, “Performance Monitoring Overview,” and Section 19.17, “Time-Stamp Counter,” for more infor-
mation about the performance monitoring and time-stamp counters.
The RDTSC instruction was introduced into the IA-32 architecture with the Pentium processor. The RDPMC instruc-
tion was introduced into the IA-32 architecture with the Pentium Pro processor and the Pentium processor with
MMX technology. Earlier Pentium processors have two performance-monitoring counters, but they can be read only
with the RDMSR instruction, and only at privilege level 0.
RDMSR reads the value from the specified MSR to the EDX:EAX registers; WRMSR writes the value in the EDX:EAX
registers to the specified MSR. RDMSR and WRMSR were introduced into the IA-32 architecture with the Pentium
processor.
See Section 11.4, “Model-Specific Registers (MSRs),” for more information.
Vol. 3A 2-27
SYSTEM ARCHITECTURE OVERVIEW
2-28 Vol. 3A
12.Updates to Chapter 3, Volume 3A
Change bars and violet text show changes to Chapter 3 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.
------------------------------------------------------------------------------------------
Changes to this chapter:
• Updated the chapter with information on the newly added linear-address pre-processing and references to the
new Chapter 4 where needed.
This chapter describes the Intel 64 and IA-32 architecture’s protected-mode memory management facilities,
including the physical memory requirements, segmentation mechanism, and paging mechanism.
See also: Chapter 6, “Protection‚” (for a description of the processor’s protection mechanism) and Chapter 22,
“8086 Emulation‚” (for a description of memory addressing protection in real-address and virtual-8086 modes).
Vol. 3A 3-1
PROTECTED-MODE MEMORY MANAGEMENT
Logical Address
(or Far Pointer)
Segment
Selector Offset Linear Address
Space
Linear Address
Global Descriptor
Dir Table Offset Physical
Table (GDT)
Address
Space
Segment
Page Table Page
Segment
Descriptor
Page Directory Phy. Addr.
Lin. Addr.
Entry
Entry
Segment
Base Address
Page
Segmentation Paging
If paging is not used, the linear address space of the processor is mapped directly into the physical address space
of processor. The physical address space is defined as the range of addresses that the processor can generate on
its address bus.
Because multitasking computing systems commonly define a linear address space much larger than it is economi-
cally feasible to contain all at once in physical memory, some method of “virtualizing” the linear address space is
needed. This virtualization of the linear address space is handled through the processor’s paging mechanism.
Paging supports a “virtual memory” environment where a large linear address space is simulated with a small
amount of physical memory (RAM and ROM) and some disk storage. When using paging, each segment is divided
into pages (typically 4 KBytes each in size), which are stored either in physical memory or on the disk. The oper-
ating system or executive maintains a page directory and a set of page tables to keep track of the pages. When a
program (or task) attempts to access an address location in the linear address space, the processor uses the page
directory and page tables to translate the linear address into a physical address and then performs the requested
operation (read or write) on the memory location.
If the page being accessed is not currently in physical memory, the processor interrupts execution of the program
(by generating a page-fault exception). The operating system or executive then reads the page into physical
memory from the disk and continues executing the program.
When paging is implemented properly in the operating-system or executive, the swapping of pages between phys-
ical memory and the disk is transparent to the correct execution of a program. Even programs written for 16-bit IA-
32 processors can be paged (transparently) when they are run in virtual-8086 mode.
3-2 Vol. 3A
PROTECTED-MODE MEMORY MANAGEMENT
programs to multi-segmented models that employ segmentation to create a robust operating environment in
which multiple programs and tasks can be executed reliably.
The following sections give several examples of how segmentation can be employed in a system to improve
memory management performance and reliability.
FS
GS
Vol. 3A 3-3
PROTECTED-MODE MEMORY MANAGEMENT
GS 0
More complexity can be added to this protected flat model to provide more protection. For example, for the paging
mechanism to provide isolation between user and supervisor code and data, four segments need to be defined:
code and data segments at privilege level 3 for the user, and code and data segments at privilege level 0 for the
supervisor. Usually these segments all overlay each other and start at address 0 in the linear address space. This
flat segmentation model along with a simple paging structure can protect the operating system from applications,
and by adding a separate paging structure for each task or process, it can also protect applications from each other.
Similar designs are used by several popular multitasking operating systems.
3-4 Vol. 3A
PROTECTED-MODE MEMORY MANAGEMENT
Access Limit
DS
Base Address Code
Access Limit
ES
Base Address
Data
Access Limit
FS
Base Address
Data
Access Limit
GS
Base Address
Data
Access Limit
Base Address
Access Limit
Base Address
Data
Access Limit
Base Address
Access Limit
Base Address
Access checks can be used to protect not only against referencing an address outside the limit of a segment, but
also against performing disallowed operations in certain segments. For example, since code segments are desig-
nated as read-only segments, hardware can be used to prevent writes into code segments. The access rights infor-
mation created for segments can also be used to set up protection rings or levels. Protection levels can be used to
protect operating-system procedures from unauthorized access by application programs.
Vol. 3A 3-5
PROTECTED-MODE MEMORY MANAGEMENT
protection facilities. For example, it lets read-write protection be enforced on a page-by-page basis. The paging
mechanism also provides two-level user-supervisor protection that can also be specified on a page-by-page basis.
3-6 Vol. 3A
PROTECTED-MODE MEMORY MANAGEMENT
15 0 31(63) 0
Logical Seg. Selector Offset (Effective Address)
Address
Descriptor Table
31(63) 0
Linear Address
If paging is not used, the processor maps the linear address directly to a physical address (that is, the linear
address goes out on the processor’s address bus). If the linear address space is paged, a second level of address
translation is used to translate the linear address into a physical address.
See also: Chapter 5, “Paging.”
15 3 2 1 0
Index T RPL
I
Table Indicator
0 = GDT
1 = LDT
Requested Privilege Level (RPL)
Vol. 3A 3-7
PROTECTED-MODE MEMORY MANAGEMENT
Every segment register has a “visible” part and a “hidden” part. (The hidden part is sometimes referred to as a
“descriptor cache” or a “shadow register.”) When a segment selector is loaded into the visible part of a segment
register, the processor also loads the hidden part of the segment register with the base address, segment limit, and
access control information from the segment descriptor pointed to by the segment selector. The information cached
in the segment register (visible and hidden) allows the processor to translate addresses without taking extra bus
cycles to read the base address and limit from the segment descriptor. In systems in which multiple processors
have access to the same descriptor tables, it is the responsibility of software to reload the segment registers when
the descriptor tables are modified. If this is not done, an old segment descriptor cached in a segment register might
be used after its memory-resident version has been modified.
Two kinds of load instructions are provided for loading the segment registers:
1. Direct load instructions such as the MOV, POP, LDS, LES, LSS, LGS, and LFS instructions. These instructions
explicitly reference the segment registers.
2. Implied load instructions such as the far pointer versions of the CALL, JMP, and RET instructions, the SYSENTER
and SYSEXIT instructions, and the IRET, INT n, INTO, INT3, and INT1 instructions. These instructions change
3-8 Vol. 3A
PROTECTED-MODE MEMORY MANAGEMENT
the contents of the CS register (and sometimes other segment registers) as an incidental part of their
operation.
The MOV instruction can also be used to store the visible part of a segment register in a general-purpose register.
Vol. 3A 3-9
PROTECTED-MODE MEMORY MANAGEMENT
31 24 23 22 21 20 19 16 15 14 13 12 11 8 7 0
D A Seg. D
Base 31:24 G / L V Limit P P S Type Base 23:16 4
B L 19:16 L
31 16 15 0
3-10 Vol. 3A
PROTECTED-MODE MEMORY MANAGEMENT
31 16 15 14 13 12 11 8 7 0
D
Available 0 P S Type Available 4
L
31 0
Available 0
G (granularity) flag
Determines the scaling of the segment limit field. When the granularity flag is clear, the segment
limit is interpreted in byte units; when flag is set, the segment limit is interpreted in 4-KByte units.
(This flag does not affect the granularity of the base address; it is always byte granular.) When the
granularity flag is set, the twelve least significant bits of an offset are not tested when checking the
Vol. 3A 3-11
PROTECTED-MODE MEMORY MANAGEMENT
offset against the segment limit. For example, when the granularity flag is set, a limit of 0 results in
valid offsets from 0 to 4095.
L (64-bit code segment) flag
In IA-32e mode, bit 21 of the second doubleword of the segment descriptor indicates whether a
code segment contains native 64-bit code. A value of 1 indicates instructions in this code segment
are executed in 64-bit mode. A value of 0 indicates the instructions in this code segment are
executed in compatibility mode. If the L-bit is set, then the D-bit must be cleared. Bit 21 is not used
outside IA-32e mode (or for data segments). Because an attempt to activate IA-32e mode will fault
if the current CS has the L-bit set (see Section 11.8.5), software operating outside IA-32e mode
should avoid loading CS from a descriptor that sets the L-bit.
Available and reserved bits
Bit 20 of the second doubleword of the segment descriptor is available for use by system software.
Stack segments are data segments which must be read/write segments. Loading the SS register with a segment
selector for a nonwritable data segment generates a general-protection exception (#GP). If the size of a stack
segment needs to be changed dynamically, the stack segment can be an expand-down data segment (expansion-
3-12 Vol. 3A
PROTECTED-MODE MEMORY MANAGEMENT
direction flag set). Here, dynamically changing the segment limit causes stack space to be added to the bottom of
the stack. If the size of a stack segment is intended to remain static, the stack segment may be either an expand-
up or expand-down type.
The accessed bit indicates whether the segment has been accessed since the last time the operating-system or
executive cleared the bit. The processor sets this bit whenever it loads a segment selector for the segment into a
segment register, assuming that the type of memory that contains the segment descriptor supports processor
writes. The bit remains set until explicitly cleared. This bit can be used both for virtual memory management and
for debugging.
For code segments, the three low-order bits of the type field are interpreted as accessed (A), read enable (R), and
conforming (C). Code segments can be execute-only or execute/read, depending on the setting of the read-enable
bit. An execute/read segment might be used when constants or other static data have been placed with instruction
code in a ROM. Here, data can be read from the code segment either by using an instruction with a CS override
prefix or by loading a segment selector for the code segment in a data-segment register (the DS, ES, FS, or GS
registers). In protected mode, code segments are not writable.
Code segments can be either conforming or nonconforming. A transfer of execution into a more-privileged
conforming segment allows execution to continue at the current privilege level. A transfer into a nonconforming
segment at a different privilege level results in a general-protection exception (#GP), unless a call gate or task gate
is used (see Section 6.8.1, “Direct Calls or Jumps to Code Segments,” for more information on conforming and
nonconforming code segments). System utilities that do not access protected facilities and handlers for some types
of exceptions (such as, divide error or overflow) may be loaded in conforming code segments. Utilities that need to
be protected from less privileged programs and procedures should be placed in nonconforming code segments.
NOTE
Execution cannot be transferred by a call or a jump to a less-privileged (numerically higher
privilege level) code segment, regardless of whether the target segment is a conforming or
nonconforming code segment. Attempting such an execution transfer will result in a general-
protection exception.
All data segments are nonconforming, meaning that they cannot be accessed by less privileged programs or proce-
dures (code executing at numerically higher privilege levels). Unlike code segments, however, data segments can
be accessed by more privileged programs or procedures (code executing at numerically lower privilege levels)
without using a special access gate.
If the segment descriptors in the GDT or an LDT are placed in ROM, the processor can enter an indefinite loop if
software or the processor attempts to update (write to) the ROM-based segment descriptors. To prevent this
problem, set the accessed bits for all segment descriptors placed in a ROM. Also, remove operating-system or
executive code that attempts to modify segment descriptors located in ROM.
Vol. 3A 3-13
PROTECTED-MODE MEMORY MANAGEMENT
Table 3-2 shows the encoding of the type field for system-segment descriptors and gate descriptors. Note that
system descriptors in IA-32e mode are 16 bytes instead of 8 bytes.
See also: Section 3.5.1, “Segment Descriptor Tables,” and Section 9.2.2, “TSS Descriptor,” (for more information
on the system-segment descriptors); see Section 6.8.3, “Call Gates,” Section 7.11, “IDT Descriptors,” and Section
9.2.5, “Task-Gate Descriptor,” (for more information on the gate descriptors).
3-14 Vol. 3A
PROTECTED-MODE MEMORY MANAGEMENT
Global Local
Descriptor Descriptor
Table (GDT) Table (LDT)
T
I TI = 0 TI = 1
Segment
Selector
56 56
48 48
40 40
32 32
24 24
16 16
8 8
First Descriptor in
GDT is Not Used 0 0
Each system must have one GDT defined, which may be used for all programs and tasks in the system. Optionally,
one or more LDTs can be defined. For example, an LDT can be defined for each separate task being run, or some or
all tasks can share the same LDT.
The GDT is not a segment itself; instead, it is a data structure in linear address space. The base linear address and
limit of the GDT must be loaded into the GDTR register (see Section 2.4, “Memory-Management Registers”). The
base address of the GDT should be aligned on an eight-byte boundary to yield the best processor performance. The
limit value for the GDT is expressed in bytes. As with segments, the limit value is added to the base address to get
the address of the last valid byte. A limit value of 0 results in exactly one valid byte. Because segment descriptors
are always 8 bytes long, the GDT limit should always be one less than an integral multiple of eight (that is, 8N – 1).
The first descriptor in the GDT is not used by the processor. A segment selector to this “null descriptor” does not
generate an exception when loaded into a data-segment register (DS, ES, FS, or GS), but it always generates a
general-protection exception (#GP) when an attempt is made to access memory using the descriptor. By initializing
the segment registers with this segment selector, accidental reference to unused segment registers can be guar-
anteed to generate an exception.
The LDT is located in a system segment of the LDT type. The GDT must contain a segment descriptor for the LDT
segment. If the system supports multiple LDTs, each must have a separate segment selector and segment
descriptor in the GDT. The segment descriptor for an LDT can be located anywhere in the GDT. See Section 3.5,
“System Descriptor Types,” for information on the LDT segment-descriptor type.
An LDT is accessed with its segment selector. To eliminate address translations when accessing the LDT, the
segment selector, base linear address, limit, and access rights of the LDT are stored in the LDTR register (see
Section 2.4, “Memory-Management Registers”).
When the GDTR register is stored (using the SGDT instruction), a 48-bit “pseudo-descriptor” is stored in memory
(see top diagram in Figure 3-11). To avoid alignment check faults in user mode (privilege level 3), the pseudo-
descriptor should be located at an odd word address (that is, address MOD 4 is equal to 2). This causes the
Vol. 3A 3-15
PROTECTED-MODE MEMORY MANAGEMENT
processor to store an aligned word, followed by an aligned doubleword. User-mode programs normally do not store
pseudo-descriptors, but the possibility of generating an alignment check fault can be avoided by aligning pseudo-
descriptors in this way. The same alignment should be used when storing the IDTR register using the SIDT instruc-
tion. When storing the LDTR or task register (using the SLDT or STR instruction, respectively), the pseudo-
descriptor should be located at a doubleword address (that is, address MOD 4 is equal to 0).
47 16 15 0
32-bit Base Address Limit
79 16 15 0
64-bit Base Address Limit
3-16 Vol. 3A
13.New Chapter 4, Volume 3A
Change bars and violet text show changes to a new Chapter 4 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.
------------------------------------------------------------------------------------------
This is a new chapter on linear-address pre-processing. It includes LASS and LAM and collects much of the
previous material on canonicality checking. Regarding the last point, it captures numerous fine points regarding
5-level paging.
As described in Chapter 3, “Protected-Mode Memory Management‚” software accesses to memory typically use
logical addresses. The processor uses segmentation, as detailed in Section 3.4, to generate linear addresses from
logical addresses. Linear addresses are then translated to physical addresses using paging, as described in Chapter
5, “Paging.”
In IA-32e mode (if IA32_EFER.LMA = 1), linear addresses may undergo some pre-processing before being trans-
lated through paging.1 Some of this pre-processing is done only if enabled by software, but some occurs uncondi-
tionally. Specifically, linear addresses are subject to pre-processing in IA-32e mode as follows:
1. Linear-address-space separation (LASS). This is a feature that, when enabled by software, may limit the
linear addresses that are accessible by software, generating faults for accesses out of range.
2. Linear-address masking (LAM). This is a feature that, when enabled by software, masks certain linear-
address bits.
3. Canonicality checking. As will be detailed in Chapter 5, paging does not translate all 64 bits of a linear
address. Each linear address must be canonical, meaning that the untranslated bits have a fixed value.
Memory accesses using a non-canonical address generate faults.
Both LASS and canonicality checking can generate faults. For any specific memory access, the two features
generate the same fault. For that reason, the relative order of that checking is not defined and cannot be deter-
mined by software.
1. The presentation in this chapter focuses on 64-bit addresses. 32-bit and 16-bit addresses can also be used in IA-32e mode. For the
purposes of this chapter, the upper bits of such addresses (32 bits and 48 bits, respectively) are treated as if they were all zero.
2. The WRUSS instruction is an exception; although it can be executed only if CPL = 0, the processor treats its shadow-stack accesses
as user-mode accesses.
Vol. 3A 4-1
LINEAR-ADDRESS PRE-PROCESSING
Some 64-bit operating systems partition the 64-bit linear-address space into a supervisor portion and a user
portion. Specifically, the upper half of the linear-address space (comprising addresses in which bit 63 is 1) is used
for supervisor instructions and data, while the lower half (addresses in which bit 63 is 0) is for user instructions and
data.
The LASS and LAM features are designed for operating systems that establish such linear-address-space parti-
tioning. However, the features are defined and may be used even if such partitioning is not in effect.
4-2 Vol. 3A
LINEAR-ADDRESS PRE-PROCESSING
• A supervisor-mode data access causes a LASS violation if it would access a linear address of which bit 63 is 0,
supervisor-mode access protection is enabled (by setting CR4.SMAP), and either RFLAGS.AC = 0 or the access
is an implicit supervisor-mode access.
• A user-mode instruction fetch causes a LASS violation if it would fetch an instruction using a linear address of
which bit 63 is 1.
• A supervisor-mode instruction fetch causes a LASS violation if it would accesses a linear address of which bit 63
is 0. (Unlike paging, this behavior of LASS applies regardless of the setting of CR4.SMEP.)
LASS for instruction fetches applies when the linear address in RIP is used to load an instruction from memory.
Unlike canonicality checking (see Section 4.5.2), LASS does not apply to branch instructions that load RIP. A
branch instruction can load RIP with an address that would violate LASS. Only when the address is used to fetch an
instruction will a LASS violation occur, generating a #GP. (The return instruction pointer of the #GP handler is the
address that incurred the LASS violation.)
Vol. 3A 4-3
LINEAR-ADDRESS PRE-PROCESSING
• If CR4.LAM_SUP = 1, LAM is enabled for supervisor addresses with a width determined by the paging mode
(see Section 5.1.1):
— If 4-level paging is enabled, LAM48 is enabled for supervisor addresses (a LAM width of 15).
— If 5-level paging is enabled, LAM57 is enabled for supervisor addresses (a LAM width of 6).
4-4 Vol. 3A
LINEAR-ADDRESS PRE-PROCESSING
NOTE
Section 4.5.2 and Section 4.5.3 discuss the canonicality checking performed by the WRMSR
instruction. The WRMSRLIST and WRMSRNS instructions perform the same canonicality checking in
corresponding situations. (Similarly, the characterization of RDMSR in Section 4.5.3 applies also to
RDMSRLIST.)
Vol. 3A 4-5
LINEAR-ADDRESS PRE-PROCESSING
With a few exceptions, the processor ensures that the addresses in these registers are always canonical in the
following ways:
• Some instructions fault on attempts to load a linear-address register with a non-canonical address:
— An execution of the LGDT or LIDT instruction causes a general-protection exception (#GP) if the base
address specified in the instruction’s memory operand is not canonical.
— An execution of the LLDT or LTR instruction causes a #GP if the base address to be loaded from the GDT is
not canonical.
— An execution of WRFSBASE or WRGSBASE causes a #GP if it would load the base address of either FS or GS
with a non-canonical address.
— An execution of WRMSR causes a #GP if it would load any of the following MSRs with a non-canonical
address: IA32_BNDCFGS, IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
IA32_INTERRUPT_SSP_TABLE_ADDR, IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_PL0_SSP,
IA32_PL1_SSP, IA32_PL2_SSP, IA32_PL3_SSP, IA32_RTIT_ADDR0_A, IA32_RTIT_ADDR0_B,
IA32_RTIT_ADDR1_A, IA32_RTIT_ADDR1_B, IA32_RTIT_ADDR2_A, IA32_RTIT_ADDR2_B,
IA32_RTIT_ADDR3_A, IA32_RTIT_ADDR3_B, IA32_S_CET, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP,
IA32_UINTR_HANDLER, IA32_UINTR_PD, IA32_UINTR_STACKADJUST, IA32_U_CET, and
IA32_UINTR_TT.1
— An execution of XRSTORS causes a #GP if it would load any of the following MSRs with a non-canonical
address: IA32_PL0_SSP, IA32_PL1_SSP, IA32_PL2_SSP, IA32_PL3_SSP, IA32_RTIT_ADDR0_A,
IA32_RTIT_ADDR0_B, IA32_RTIT_ADDR1_A, IA32_RTIT_ADDR1_B, IA32_RTIT_ADDR2_A,
IA32_RTIT_ADDR2_B, IA32_RTIT_ADDR3_A, IA32_RTIT_ADDR3_B, IA32_U_CET,
IA32_UINTR_HANDLER, IA32_UINTR_PD, IA32_UINTR_STACKADJUST, or IA32_UINTR_TT.
With a small number of exceptions, this enforcement checks for CPU canonicality and is thus independent of the
current paging mode. Thus, a processor that supports 5-level paging will allow the instructions mentioned
above to load these registers with addresses that are 57-bit canonical but not 48-bit canonical, even if 4-level
paging is active. (As a result, instructions that store these values — SGDT, SIDT, SLDT, STR, RDFSBASE,
RDGSBASE, RDMSR, XSAVE, XSAVEC, XSAVEOPT, and XSAVES — may save addresses that are 57-bit canonical
but not 48-bit canonical, even if 4-level paging is active.)
The WRFSBASE and WRGSBASE instructions, which load the base address of FS and GS, respectively, operate
differently. An execution of either of these instructions causes a #GP if it would load a base address with an
address that is not paging canonical. Thus, if 4-level paging is active, these instructions do not allow loading of
addresses that are 57-bit canonical but not 48-bit canonical.
• The FXRSTOR, XRSTOR, and XRSTORS instructions ignore attempts to load some of these registers with non-
canonical addresses:
— Loads of FIP ignore any bits in the memory image beyond the enumerated maximum linear-address width.
The processor sign-extends the most significant bit (e.g., bit 56 on processors that support 5-level paging)
to ensure that FIP is always CPU canonical.
— Loads of BNDCFGU (by XRSTOR or XRSTORS) ignore any bits in the memory image beyond the enumerated
maximum linear-address width. The processor sign-extends the most significant bit to ensure that
BNDCFGU is always CPU canonical.
• Every non-control x87 instruction loads FIP. The value loaded is always paging canonical.
• CR2 can be loaded with the MOV to CR instruction. The instruction allows that register to be loaded with a non-
canonical address. The MOV from CR instruction will return for CR2 the value last loaded into that register by a
page fault or with the MOV to CR instruction, even if (for the latter case) the address is not canonical. Page
faults load CR2 only with linear addresses that are paging canonical.
• DR0 through DR3 can be loaded with the MOV to DR instruction. The instruction allows those registers to be
loaded with non-canonical addresses. The MOV from DR instruction will return for a debug register the value
1. Such canonicality checking may apply also when the WRMSR instruction is used to load some non-architec-
tural MSRs (not listed here) that hold a linear address.
4-6 Vol. 3A
LINEAR-ADDRESS PRE-PROCESSING
last loaded into that register with the MOV to DR instruction, even if the address is not canonical. Breakpoint
address matching is supported only for linear addresses that are paging canonical.
1. INVVPID is a VMX instruction. In response to certain conditions, execution of a VMX instruction may fail, meaning that it does not
complete its normal operation. When a VMX instruction fails, control passes to the next instruction (rather than to a fault handler)
and a flag is set to report the failure.
Vol. 3A 4-7
LINEAR-ADDRESS PRE-PROCESSING
4-8 Vol. 3A
14.Updates to Chapter 5, Volume 3A
Change bars and violet text show changes to Chapter 5 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.
------------------------------------------------------------------------------------------
Changes to this chapter:
• This is the paging chapter. It contains a few references to the new Chapter 4 (especially about LAM).
Chapter 3 explains how segmentation converts logical addresses to linear addresses. Paging (or linear-address
translation) is the process of translating linear addresses so that they can be used to access memory or I/O
devices. Paging translates each linear address to a physical address and determines, for each translation, what
accesses to the linear address are allowed (the address’s access rights) and the type of caching used for such
accesses (the address’s memory type).
Intel-64 processors support four different paging modes. These modes are identified and defined in Section 5.1.
Section 5.2 gives an overview of the translation mechanism that is used in all modes. Section 5.3, Section 5.4, and
Section 5.5 discuss the four paging modes in detail.
Section 5.6 details how paging determines and uses access rights. Section 5.7 discusses exceptions that may be
generated by paging (page-fault exceptions). Section 5.8 considers data which the processor writes in response to
linear-address accesses (accessed and dirty flags).
Section 5.9 describes how paging determines the memory types used for accesses to linear addresses. Section
5.10 provides details of how a processor may cache information about linear-address translation. Section 5.11
outlines interactions between paging and certain VMX features. Section 5.12 gives an overview of how paging can
be used to implement virtual memory.
Vol. 3A 5-1
PAGING
• If CR4.PAE = 0, 32-bit paging is used. 32-bit paging is detailed in Section 5.3. 32-bit paging uses CR0.WP,
CR4.PSE, CR4.PGE, CR4.SMEP, CR4.SMAP, and CR4.CET as described in Section 5.1.3 and Section 5.6.
• If CR4.PAE = 1 and IA32_EFER.LME = 0, PAE paging is used. PAE paging is detailed in Section 5.4. PAE paging
uses CR0.WP, CR4.PGE, CR4.SMEP, CR4.SMAP, CR4.CET, and IA32_EFER.NXE as described in Section 5.1.3 and
Section 5.6.
• If CR4.PAE = 1, IA32_EFER.LME = 1, and CR4.LA57 = 0, 4-level paging1 is used.2 4-level paging is detailed
in Section 5.5 (along with 5-level paging). 4-level paging uses CR0.WP, CR4.PGE, CR4.PCIDE, CR4.SMEP,
CR4.SMAP, CR4.PKE, CR4.CET, CR4.PKS, and IA32_EFER.NXE as described in Section 5.1.3 and Section 5.6.
• If CR4.PAE = 1, IA32_EFER.LME = 1, and CR4.LA57 = 1, 5-level paging is used. 5-level paging is detailed in
Section 5.5 (along with 4-level paging). 5-level paging uses CR0.WP, CR4.PGE, CR4.PCIDE, CR4.SMEP,
CR4.SMAP, CR4.PKE, CR4.CET, CR4.PKS, and IA32_EFER.NXE as described in Section 5.1.3 and Section 5.6.
NOTE
32-bit paging and PAE paging can be used only in legacy protected mode (IA32_EFER.LME = 0). In
contrast, 4-level paging and 5-level paging can be used only IA-32e mode (IA32_EFER.LME = 1).
The four paging modes differ with regard to the following details:
• Linear-address width. The size of the linear addresses that can be translated.
• Physical-address width. The size of the physical addresses produced by paging.
• Page size. The granularity at which linear addresses are translated. Linear addresses on the same page are
translated to corresponding physical addresses on the same page.
• Support for execute-disable access rights. In some paging modes, software can be prevented from fetching
instructions from pages that are otherwise readable.
• Support for PCIDs. With 4-level paging and 5-level paging, software can enable a facility by which a logical
processor caches information for multiple linear-address spaces. The processor may retain cached information
when software switches between different linear-address spaces.
• Support for protection keys. With 4-level paging and 5-level paging, each linear address is associated with a
protection key. Software can use the protection-key rights registers to disable, for each protection key, how
certain accesses to linear addresses associated with that protection key.
Table 5-1 illustrates the principal differences between the four paging modes.
Supports
Lin.- Phys.- Supports
Paging PG in PAE in LME in LA57 in Page PCIDs and
Addr. Addr. Execute-
Mode CR0 CR4 IA32_EFER CR4 Sizes protection
Width Width1 Disable?
keys?
4 KB
32-bit 1 0 02 N/A 32 Up to 403 No No
4 MB4
4 KB
PAE 1 1 0 N/A 32 Up to 52 Yes5 No
2 MB
4 KB
4-level 1 1 1 0 48 Up to 52 2 MB Yes5 Yes7
1 GB6
1. Earlier versions of this manual used the term “IA-32e paging” to identify 4-level paging.
2. The LMA flag in the IA32_EFER MSR (bit 10) is a status bit that indicates whether the logical processor is in IA-32e mode (and thus
uses either 4-level paging or 5-level paging). The processor always sets IA32_EFER.LMA to CR0.PG & IA32_EFER.LME. Software can-
not directly modify IA32_EFER.LMA; an execution of WRMSR to the IA32_EFER MSR ignores bit 10 of its source operand.
5-2 Vol. 3A
PAGING
Supports
Lin.- Phys.- Supports
Paging PG in PAE in LME in LA57 in Page PCIDs and
Addr. Addr. Execute-
Mode CR0 CR4 IA32_EFER CR4 Sizes protection
Width Width1 Disable?
keys?
4 KB
5-level 1 1 1 1 57 Up to 52 2 MB Yes5 Yes7
1 GB6
NOTES:
1. The physical-address width is always bounded by MAXPHYADDR; see Section 5.1.4.
2. The processor ensures that IA32_EFER.LME must be 0 if CR0.PG = 1 and CR4.PAE = 0.
3. 32-bit paging supports physical-address widths of more than 32 bits only for 4-MByte pages and only if the PSE-36 mechanism is
supported; see Section 5.1.4 and Section 5.3.
4. 32-bit paging uses 4-MByte pages only if CR4.PSE = 1; see Section 5.3.
5. Execute-disable access rights are applied only if IA32_EFER.NXE = 1; see Section 5.6.
6. Processors that support 4-level paging or 5-level paging do not necessarily support 1-GByte pages; see Section 5.1.4.
7. PCIDs are used only if CR4.PCIDE = 1; see Section 5.10.1. Protection keys are used only if certain conditions hold; see Section 5.6.2.
Because 32-bit paging and PAE paging are used only in legacy protected mode and because legacy protected mode
cannot produce linear addresses larger than 32 bits, 32-bit paging and PAE paging translate 32-bit linear
addresses.
4-level paging and 5-level paging are used only in IA-32e mode. IA-32e mode has two sub-modes:
• Compatibility mode. This sub-mode uses only 32-bit linear addresses. In this sub-mode, 4-level paging and 5-
level paging treat bits 63:32 of such an address as all 0. These addresses are subject to linear-address pre-
processing, specifically linear-address-space separation (Section 4.3).
• 64-bit mode. This sub-mode produces 64-bit linear addresses. These addresses are then subject to linear-
address pre-processing (Chapter 4). As part of this, the processor enforces canonicality (Section 4.5),
ensuring that the upper bits of such an address are identical: bits 63:47 for 4-level paging and bits 63:56 for
5-level paging. 4-level paging (respectively, 5-level paging) does not use bits 63:48 (respectively, bits 63:57)
of such addresses.
1. If the logical processor is in 64-bit mode or if CR4.PCIDE = 1, an attempt to clear CR0.PG causes a general-protection exception
(#GP). Software should transition to compatibility mode and clear CR4.PCIDE before attempting to disable paging.
Vol. 3A 5-3
PAGING
#GP #GP
Set LME
Set LME
#GP
Clear LME
Clear LME
Clear LME
Set PG
No Paging
PG = 0
PAE = 1
LME = 1
• Software can transition between 32-bit paging and PAE paging by changing the value of CR4.PAE with MOV to
CR4.
• Software cannot transition directly between 4-level paging (or 5-level paging) and any of other paging mode.
It must first disable paging (by clearing CR0.PG with MOV to CR0), then set CR4.PAE, IA32_EFER.LME, and
CR4.LA57 to the desired values (with MOV to CR4 and WRMSR), and then re-enable paging (by setting CR0.PG
with MOV to CR0). As noted earlier, an attempt to modify CR4.PAE, IA32_EFER.LME, or CR.LA57 while 4-level
paging or 5-level paging is enabled causes a general-protection exception (#GP(0)).
• VMX transitions allow transitions between paging modes that are not possible using MOV to CR or WRMSR. This
is because VMX transitions can load CR0, CR4, and IA32_EFER in one operation. See Section 5.11.1.
5-4 Vol. 3A
PAGING
CR0.WP allows pages to be protected from supervisor-mode writes. If CR0.WP = 0, supervisor-mode write
accesses are allowed to linear addresses with read-only access rights; if CR0.WP = 1, they are not. (User-mode
write accesses are never allowed to linear addresses with read-only access rights, regardless of the value of
CR0.WP.) Section 5.6 explains how access rights are determined, including the definition of supervisor-mode and
user-mode accesses.
CR4.PSE enables 4-MByte pages for 32-bit paging. If CR4.PSE = 0, 32-bit paging can use only 4-KByte pages; if
CR4.PSE = 1, 32-bit paging can use both 4-KByte pages and 4-MByte pages. See Section 5.3 for more information.
(PAE paging, 4-level paging, and 5-level paging can use multiple page sizes regardless of the value of CR4.PSE.)
CR4.PGE enables global pages. If CR4.PGE = 0, no translations are shared across address spaces; if CR4.PGE = 1,
specified translations may be shared across address spaces. See Section 5.10.2.4 for more information.
CR4.PCIDE enables process-context identifiers (PCIDs) for 4-level paging and 5-level paging. PCIDs allow a logical
processor to cache information for multiple linear-address spaces. See Section 5.10.1 for more information.
CR4.SMEP allows pages to be protected from supervisor-mode instruction fetches. If CR4.SMEP = 1, software
operating in supervisor mode cannot fetch instructions from linear addresses that are accessible in user mode.
Section 5.6 explains how access rights are determined, including the definition of supervisor-mode accesses and
user-mode accessibility.
CR4.SMAP allows pages to be protected from supervisor-mode data accesses. If CR4.SMAP = 1, software oper-
ating in supervisor mode cannot access data at linear addresses that are accessible in user mode. Software can
override this protection by setting EFLAGS.AC. Section 5.6 explains how access rights are determined, including
the definition of supervisor-mode accesses and user-mode accessibility.
CR4.PKE and CR4.PKS enable specification of access rights based on protection keys. 4-level paging and 5-level
paging associate each linear address with a protection key. When CR4.PKE = 1, the PKRU register specifies, for
each protection key, whether user-mode linear addresses with that protection key can be read or written. When
CR4.PKS = 1, the IA32_PKRS MSR does the same for supervisor-mode linear addresses. See Section 5.6 for more
information.
CR4.CET enables control-flow enforcement technology, including the shadow-stack feature. If CR4.CET = 1,
certain memory accesses are identified as shadow-stack accesses and certain linear addresses translate to
shadow-stack pages. Section 5.6 explains how access rights are determined for these accesses and pages. (The
processor allows CR4.CET to be set only if CR0.WP is also set.)
IA32_EFER.NXE enables execute-disable access rights for PAE paging, 4-level paging, and 5-level paging. If
IA32_EFER.NXE = 1, instruction fetches can be prevented from specified linear addresses (even if data reads from
the addresses are allowed). Section 5.6 explains how access rights are determined. (IA32_EFER.NXE has no effect
with 32-bit paging. Software that wants to use this feature to limit instruction fetches from readable pages must
use PAE paging, 4-level paging, or 5-level paging.)
The “enable HLAT” VM-execution control enables HLAT paging for 4-level paging and 5-level paging. HLAT paging
does not use control register CR3 to identify the address of the first paging structure used for linear-address trans-
lation; instead, that structure is located using a field in the virtual-machine control structure (VMCS). In addition,
HLAT paging interprets certain bits in paging-structure entries differently than ordinary paging. See Section 5.5 for
details.
Vol. 3A 5-5
PAGING
5-6 Vol. 3A
PAGING
1. If HLAT paging is in use, a different mechanism is used to identify the first paging structure. See Section 5.5 for more information.
Vol. 3A 5-7
PAGING
• If only 12 bits remain in the linear address, the current paging-structure entry always maps a page (bit 7 is
used for other purposes).
If a paging-structure entry maps a page when more than 12 bits remain in the linear address, the entry identifies
a page frame larger than 4 KBytes. For example, 32-bit paging uses the upper 10 bits of a linear address to locate
the first paging-structure entry; 22 bits remain. If that entry maps a page, the page frame is 222 Bytes = 4 MBytes.
32-bit paging can use 4-MByte pages if CR4.PSE = 1. The other paging modes can use 2-MByte pages (regardless
of the value of CR4.PSE). 4-level paging and 5-level paging can use 1-GByte pages if the processor supports them
(see Section 5.1.4).
Paging structures are given different names based on their uses in the translation process. Table 5-2 gives the
names of the different paging structures. It also provides, for each structure, the source of the physical address
used to locate it (CR3 or a different paging-structure entry); the bits in the linear address used to select an entry
from the structure; and details of whether and how such an entry can map a page.
Physical
Entry Bits Selecting
Paging Structure Paging Mode Address of Page Mapping
Name Entry
Structure
32-bit N/A
Page-directory-
PDPTE PAE CR3 31:30 N/A (PS must be 0)
pointer table
4-level, 5-level PML4E 38:30 1-GByte page if PS=12
32-bit 21:12
Page table PTE PDE 4-KByte page
PAE, 4-level, 5-level 20:12
NOTES:
1. If HLAT paging is in use, a different mechanism is used to identify the first paging structure. See Section 5.5 for more information.
2. Not all processors support 1-GByte pages; see Section 5.1.4.
3. 32-bit paging ignores the PS flag in a PDE (and uses the entry to reference a page table) unless CR4.PSE = 1. Not all processors sup-
port 4-MByte pages with 32-bit paging; see Section 5.1.4.
5-8 Vol. 3A
PAGING
1. Bits in the range 39:32 are 0 in any physical address used by 32-bit paging except those used to map 4-MByte pages. If the proces-
sor does not support the PSE-36 mechanism, this is true also for physical addresses used to map 4-MByte pages. If the processor
does support the PSE-36 mechanism and MAXPHYADDR < 40, bits in the range 39:MAXPHYADDR are 0 in any physical address used
to map a 4-MByte page. (The corresponding bits are reserved in PDEs.) See Section 5.1.4 for how to determine MAXPHYADDR and
whether the PSE-36 mechanism is supported.
2. The upper bits in the final physical address do not all come from corresponding positions in the PDE; the physical-address bits in the
PDE are not all contiguous.
Vol. 3A 5-9
PAGING
Linear Address
31 22 21 12 11 0
Directory Table Offset
12 4-KByte Page
PTE
20
PDE with PS=0
20
32
CR3
1. See Section 5.1.4 for how to determine MAXPHYADDR and whether the PSE-36 mechanism is supported.
2. See Section 5.1.4 for how to determine whether the PAT is supported.
5-10 Vol. 3A
PAGING
Linear Address
31 22 21 0
Directory Offset
22 4-MByte Page
10 Page Directory
Physical Address
32
CR3
Figure 5-4 gives a summary of the formats of CR3 and the paging-structure entries with 32-bit paging. For the
paging structure entries, it identifies separately the format of entries that map pages, those that reference other
paging structures, and those that do neither because they are “not present”; bit 0 (P) and bit 7 (PS) are high-
lighted because they determine how such an entry is used.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
P P
Address of page directory1 Ignored C W Ignored CR3
D T
I P P U R PDE:
Address of page table Ignored 0 g A C W / / 1 page
n D T S W table
PDE:
Ignored 0 not
present
P P P U R PTE:
Address of 4KB page frame Ignored G A D A C W / / 1 4KB
T D T S W page
PTE:
Ignored 0 not
present
Figure 5-4. Formats of CR3 and Paging-Structure Entries with 32-Bit Paging
NOTES:
1. CR3 has 64 bits on processors supporting the Intel-64 architecture. These bits are ignored with 32-bit paging.
2. This example illustrates a processor in which MAXPHYADDR is 36. If this value is larger or smaller, the number of bits reserved in
positions 20:13 of a PDE mapping a 4-MByte page will change.
Vol. 3A 5-11
PAGING
Bit Contents
Position(s)
2:0 Ignored
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page directory during linear-
address translation (see Section 5.9)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page directory during linear-
address translation (see Section 5.9)
11:5 Ignored
31:12 Physical address of the 4-KByte aligned page directory used for linear-address translation
63:32 Ignored (these bits exist only on processors supporting the Intel-64 architecture)
Table 5-4. Format of a 32-Bit Page-Directory Entry that Maps a 4-MByte Page
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 4-MByte page referenced by this entry (see Section 5.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 4-MByte page referenced by this entry (see Section
5.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 4-MByte page referenced by
this entry (see Section 5.9)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 4-MByte page referenced by
this entry (see Section 5.9)
5 (A) Accessed; indicates whether software has accessed the 4-MByte page referenced by this entry (see Section 5.8)
6 (D) Dirty; indicates whether software has written to the 4-MByte page referenced by this entry (see Section 5.8)
7 (PS) Page size; must be 1 (otherwise, this entry references a page table; see Table 5-5)
8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 5.10); ignored otherwise
11:9 Ignored
12 (PAT) If the PAT is supported, indirectly determines the memory type used to access the 4-MByte page referenced by this
entry (see Section 5.9.2); otherwise, reserved (must be 0)1
(M–20):13 Bits (M–1):32 of physical address of the 4-MByte page referenced by this entry2
31:22 Bits 31:22 of physical address of the 4-MByte page referenced by this entry
NOTES:
1. See Section 5.1.4 for how to determine whether the PAT is supported.
2. If the PSE-36 mechanism is not supported, M is 32, and this row does not apply. If the PSE-36 mechanism is supported, M is the min-
imum of 40 and MAXPHYADDR (this row does not apply if MAXPHYADDR = 32). See Section 5.1.4 for how to determine MAXPHY-
ADDR and whether the PSE-36 mechanism is supported.
5-12 Vol. 3A
PAGING
Table 5-5. Format of a 32-Bit Page-Directory Entry that References a Page Table
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 4-MByte region controlled by this entry (see Section 5.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 4-MByte region controlled by this entry (see Section
5.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 5.9)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 5.9)
5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 5.8)
6 Ignored
7 (PS) If CR4.PSE = 1, must be 0 (otherwise, this entry maps a 4-MByte page; see Table 5-4); otherwise, ignored
11:8 Ignored
31:12 Physical address of 4-KByte aligned page table referenced by this entry
Table 5-6. Format of a 32-Bit Page-Table Entry that Maps a 4-KByte Page
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 4-KByte page referenced by this entry (see Section 5.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 4-KByte page referenced by this entry (see Section
5.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 5.9)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 5.9)
5 (A) Accessed; indicates whether software has accessed the 4-KByte page referenced by this entry (see Section 5.8)
6 (D) Dirty; indicates whether software has written to the 4-KByte page referenced by this entry (see Section 5.8)
7 (PAT) If the PAT is supported, indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 5.9.2); otherwise, reserved (must be 0)1
8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 5.10); ignored otherwise
11:9 Ignored
NOTES:
1. See Section 5.1.4 for how to determine whether the PAT is supported.
Vol. 3A 5-13
PAGING
Bit Contents
Position(s)
4:0 Ignored
31:5 Physical address of the 32-Byte aligned page-directory-pointer table used for linear-address translation
63:32 Ignored (these bits exist only on processors supporting the Intel-64 architecture)
The page-directory-pointer-table comprises four (4) 64-bit entries called PDPTEs. Each PDPTE controls access to a
1-GByte region of the linear-address space. Corresponding to the PDPTEs, the logical processor maintains a set of
four (4) internal, non-architectural PDPTE registers, called PDPTE0, PDPTE1, PDPTE2, and PDPTE3. The logical
processor loads these registers from the PDPTEs in memory as part of certain operations:
• If PAE paging would be in use following an execution of MOV to CR0 or MOV to CR4 (see Section 5.1.1) and the
instruction is modifying any of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then the
PDPTEs are loaded from the address in CR3.
• If MOV to CR3 is executed while the logical processor is using PAE paging, the PDPTEs are loaded from the
address being loaded into CR3.
• If PAE paging is in use and a task switch changes the value of CR3, the PDPTEs are loaded from the address in
the new CR3 value.
• Certain VMX transitions load the PDPTE registers. See Section 5.11.1.
Table 5-8 gives the format of a PDPTE. If any of the PDPTEs sets both the P flag (bit 0) and any reserved bit, the
MOV to CR instruction causes a general-protection exception (#GP(0)) and the PDPTEs are not loaded.2 As shown
in Table 5-8, bits 2:1, 8:5, and 63:MAXPHYADDR are reserved in the PDPTEs.
1. If MAXPHYADDR < 52, bits in the range 51:MAXPHYADDR will be 0 in any physical address used by PAE paging. (The corresponding
bits are reserved in the paging-structure entries.) See Section 5.1.4 for how to determine MAXPHYADDR.
2. On some processors, reserved bits are checked even in PDPTEs in which the P flag (bit 0) is 0.
5-14 Vol. 3A
PAGING
Bit Contents
Position(s)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page directory referenced by
this entry (see Section 5.9)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page directory referenced by
this entry (see Section 5.9)
11:9 Ignored
(M–1):12 Physical address of 4-KByte aligned page directory referenced by this entry1
NOTES:
1. M is an abbreviation for MAXPHYADDR, which is at most 52; see Section 5.1.4.
1. With PAE paging, the processor does not use CR3 when translating a linear address (as it does in the other paging modes). It does
not access the PDPTEs in the page-directory-pointer table during linear-address translation.
Vol. 3A 5-15
PAGING
Linear Address
31 30 29 21 20 12 11 0
Directory Pointer Directory Table Offset
12 4-KByte Page
PDPTE Registers
40
PDPTE value
1. See Section 5.1.4 for how to determine whether the PAT is supported.
5-16 Vol. 3A
PAGING
Linear Address
31 30 29 21 20 0
Directory Offset
Pointer Directory
21 2-MByte Page
9
Page Directory Physical Address
PDPTE Registers
2
PDE with PS=1
31
PDPTE value
40
Table 5-9. Format of a PAE Page-Directory Entry that Maps a 2-MByte Page
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 2-MByte page referenced by this entry (see Section 5.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 2-MByte page referenced by this entry (see Section
5.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 2-MByte page referenced by
this entry (see Section 5.9)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 2-MByte page referenced by this
entry (see Section 5.9)
5 (A) Accessed; indicates whether software has accessed the 2-MByte page referenced by this entry (see Section 5.8)
6 (D) Dirty; indicates whether software has written to the 2-MByte page referenced by this entry (see Section 5.8)
7 (PS) Page size; must be 1 (otherwise, this entry references a page table; see Table 5-10)
8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 5.10); ignored otherwise
11:9 Ignored
12 (PAT) If the PAT is supported, indirectly determines the memory type used to access the 2-MByte page referenced by this
entry (see Section 5.9.2); otherwise, reserved (must be 0)1
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 2-MByte page controlled by
this entry; see Section 5.6); otherwise, reserved (must be 0)
NOTES:
1. See Section 5.1.4 for how to determine whether the PAT is supported.
Vol. 3A 5-17
PAGING
Table 5-10. Format of a PAE Page-Directory Entry that References a Page Table
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 2-MByte region controlled by this entry (see Section 5.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 2-MByte region controlled by this entry (see
Section 5.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 5.9)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 5.9)
5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 5.8)
6 Ignored
7 (PS) Page size; must be 0 (otherwise, this entry maps a 2-MByte page; see Table 5-9)
11:8 Ignored
(M–1):12 Physical address of 4-KByte aligned page table referenced by this entry
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 2-MByte region controlled
by this entry; see Section 5.6); otherwise, reserved (must be 0)
Table 5-11. Format of a PAE Page-Table Entry that Maps a 4-KByte Page
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 4-KByte page referenced by this entry (see Section 5.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 4-KByte page referenced by this entry (see Section
5.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 4-KByte page referenced by
this entry (see Section 5.9)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 5.9)
5 (A) Accessed; indicates whether software has accessed the 4-KByte page referenced by this entry (see Section 5.8)
6 (D) Dirty; indicates whether software has written to the 4-KByte page referenced by this entry (see Section 5.8)
7 (PAT) If the PAT is supported, indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 5.9.2); otherwise, reserved (must be 0)1
8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 5.10); ignored otherwise
5-18 Vol. 3A
PAGING
Table 5-11. Format of a PAE Page-Table Entry that Maps a 4-KByte Page (Contd.)
Bit Contents
Position(s)
11:9 Ignored
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 4-KByte page controlled by
this entry; see Section 5.6); otherwise, reserved (must be 0)
NOTES:
1. See Section 5.1.4 for how to determine whether the PAT is supported.
Figure 5-7 gives a summary of the formats of CR3 and the paging-structure entries with PAE paging. For the paging
structure entries, it identifies separately the format of entries that map pages, those that reference other paging
structures, and those that do neither because they are “not present”; bit 0 (P) and bit 7 (PS) are highlighted
because they determine how a paging-structure entry is used.
6666555555555 33322222222221111111111
M1 M-1
3210987654321 210987654321098765432109876543210
P P Rs
PDPTE:
Reserved3 Address of page directory Ign. Rsvd. CW 1
D T vd present
PDTPE:
Ignored 0 not
present
X P PPUR PDE:
Address of
D Reserved Reserved A Ign. G 1 D A C W /S / 1 2MB
2MB page frame
4 T DT W page
X I PPUR PDE:
Reserved Address of page table Ign. 0 g A C W /S / 1 page
D n DT W table
PDE:
Ignored 0 not
present
X P PPUR PTE:
Reserved Address of 4KB page frame Ign. G A D A C W /S / 1 4KB
D T DT W page
PTE:
Ignored 0 not
present
Figure 5-7. Formats of CR3 and Paging-Structure Entries with PAE Paging
NOTES:
1. M is an abbreviation for MAXPHYADDR.
2. CR3 has 64 bits only on processors supporting the Intel-64 architecture. These bits are ignored with PAE paging.
3. Reserved fields must be 0.
4. If IA32_EFER.NXE = 0 and the P flag of a PDE or a PTE is 1, the XD flag (bit 63) is reserved.
Vol. 3A 5-19
PAGING
5.5.2 Use of CR3 with Ordinary 4-Level Paging and 5-Level Paging
Ordinary 4-level paging and 5-level paging each translate linear addresses using a hierarchy of in-memory paging
structures located using the contents of CR3, which is used to locate the first paging structure. For 4-level paging,
this is the PML4 table, and for 5-level paging it is the PML5 table. Use of CR3 with 4-level paging and 5-level paging
depends on whether process-context identifiers (PCIDs) have been enabled by setting CR4.PCIDE:
• Table 5-12 illustrates how CR3 is used with 4-level paging and 5-level paging if CR4.PCIDE = 0.
Table 5-12. Use of CR3 with 4-Level Paging and 5-level Paging and CR4.PCIDE = 0
Bit Contents
Position(s)
2:0 Ignored
1. If MAXPHYADDR < 52, bits in the range 51:MAXPHYADDR will be 0 in any physical address used by 4-level paging. (The correspond-
ing bits are reserved in the paging-structure entries.) See Section 5.1.4 for how to determine MAXPHYADDR.
2. HLAT paging is used only with 4-level paging and 5-level paging. It is never used with 32-bit paging or PAE paging, regardless of the
value of the “enable HLAT” VM-execution control.
3. This behavior applies if the CPU enumerates a maximum HLAT prefix size of 1 in IA32_VMX_EPT_VPID_CAP[53:48] (see Appendix
A.10). Behavior when a different value is enumerated is not currently defined.
5-20 Vol. 3A
PAGING
Table 5-12. Use of CR3 with 4-Level Paging and 5-level Paging and CR4.PCIDE = 0 (Contd.)
Bit Contents
Position(s)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the PML4 table or PML5 table
during linear-address translation (see Section 5.9.2)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the PML4 table or PML5 table during
linear-address translation (see Section 5.9.2)
11:5 Ignored
M–1:12 Physical address of the 4-KByte aligned PML4 table or PML5 table used for linear-address translation1
NOTES:
1. M is an abbreviation for MAXPHYADDR, which is at most 52; see Section 5.1.4.
2. LAM is not a paging feature.
3. See Section 5.10.4.1 for use of bit 63 of the source operand of the MOV to CR3 instruction.
• Table 5-13 illustrates how CR3 is used with 4-level paging and 5-level paging if CR4.PCIDE = 1.
Table 5-13. Use of CR3 with 4-Level Paging and 5-Level Paging and CR4.PCIDE = 1
Bit Contents
Position(s)
M–1:12 Physical address of the 4-KByte aligned PML4 table used for linear-address translation2
NOTES:
1. Section 5.9.2 explains how the processor determines the memory type used to access the PML4 table during linear-address transla-
tion with CR4.PCIDE = 1.
2. M is an abbreviation for MAXPHYADDR, which is at most 52; see Section 5.1.4.
3. LAM is not a paging feature.
4. See Section 5.10.4.1 for use of bit 63 of the source operand of the MOV to CR3 instruction.
After software modifies the value of CR4.PCIDE, the logical processor immediately begins using CR3 as specified
for the new value. For example, if software changes CR4.PCIDE from 1 to 0, the current PCID immediately changes
from CR3[11:0] to 000H (see also Section 5.10.4.1). In addition, the logical processor subsequently determines
the memory type used to access the PML4 table using CR3.PWT and CR3.PCD, which had been bits 4:3 of the PCID.
Vol. 3A 5-21
PAGING
5.5.3 Use of HLATP with HLAT 4-Level Paging and 5-Level Paging
With HLAT paging, 4-level paging and 5-level paging each translate linear addresses using a hierarchy of in-
memory paging structures located using the value of HLATP (a VM-execution control field in the VMCS), which is
used to locate the first paging structure. For 4-level paging, this is the PML4 table, and for 5-level paging it is the
PML5 table.
HLATP has the same format as that given for CR3 in Table 5-12, with the exception that bits 2:0 and bits 11:5 are
reserved and must be zero (these bits are checked by VM entry). HLATP does not contain a PCID value. HLAT
paging with CR4.PCIDE = 1 uses the PCID value in CR3[11:0].
Linear Address
47 39 38 30 29 21 20 12 11 0
PML4 Directory Ptr Directory Table Offset
9 9
9 12 4-KByte Page
Physical Addr
PTE
Page-Directory- PDE with PS=0
40
Pointer Table 40 Page Table
Page-Directory
PDPTE 40
40
PML4E
40
CR3 or HLATP
5-22 Vol. 3A
PAGING
Linear Address
47 39 38 30 29 21 20 0
PML4 Directory Ptr Directory Offset
9 21
9
2-MByte Page
Physical Addr
Page-Directory- PDE with PS=1
Pointer Table 31
Page-Directory
PDPTE
40
9
40
PML4E
40
CR3 or HLATP
Linear Address
47 39 38 30 29 0
PML4 Directory Ptr Offset
30
9
1-GByte Page
Page-Directory-
Pointer Table
Physical Addr
PDPTE with PS=1
22
9
40
PML4E
40
CR3 or HLATP
Vol. 3A 5-23
PAGING
4-level paging and 5-level paging associate with each linear address a protection key. Section 5.6 explains how
the processor uses the protection key in its determination of the access rights of each linear address.
The remainder of this section describes the translation process used by 4-level paging and 5-level paging in more
detail, as well has how the page size and protection key are determined. Because the process used by the two
paging modes is similar, they are described together, with any differences identified, in the following items:
• With 5-level paging, a 4-KByte naturally aligned PML5 table is located at the physical address specified in
bits 51:12 of CR3 (see Table 5-12). (4-level paging does not use a PML5 table and omits this step.) A PML5
table comprises 512 64-bit entries (PML5Es). A PML5E is selected using the physical address defined as follows:
— Bits 51:12 are from CR3 or HLATP.
— Bits 11:3 are bits 56:48 of the linear address.
— Bits 2:0 are all 0.
Because a PML5E is identified using bits 56:48 of the linear address, it controls access to a 256-TByte region of
the linear-address space.
With HLAT paging, if bit 11 of the PML5E is 1, translation is restarted with ordinary paging with a maximum
page size of 256-TBytes (see Section 5.5.5). Otherwise, the translation process continues as described in the
next item.
• A 4-KByte naturally aligned PML4 table is located at the physical address specified in bits 51:12 of CR3 (for 4-
level paging; see Table 5-12) or in bits 51:12 of the PML5E (for 5-level paging; see Table 5-14). A PML4 table
comprises 512 64-bit entries (PML4Es). A PML4E is selected using the physical address defined as follows:
— Bits 51:12 are from CR3 or the HLATP (for 4-level paging) or in bits 51:12 of the PML5E (for 5-level
paging).
— Bits 11:3 are bits 47:39 of the linear address.
— Bits 2:0 are all 0.
Because a PML4E is identified using bits 47:39 of the linear address, it controls access to a 512-GByte region of
the linear-address space.
With HLAT paging, if bit 11 of the PML4E is 1, translation is restarted with ordinary paging with a maximum
page size of 512-GBytes (see Section 5.5.5). Otherwise, the translation process continues as described in the
next item.
• A 4-KByte naturally aligned page-directory-pointer table is located at the physical address specified in
bits 51:12 of the PML4E (see Table 5-15). A page-directory-pointer table comprises 512 64-bit entries
(PDPTEs). A PDPTE is selected using the physical address defined as follows:
— Bits 51:12 are from the PML4E.
— Bits 11:3 are bits 38:30 of the linear address.
— Bits 2:0 are all 0.
Because a PDPTE is identified using bits 47:30 of the linear address, it controls access to a 1-GByte region of the
linear-address space.
With HLAT paging, if bit 11 of the PDPTE is 1, translation is restarted with ordinary paging with a maximum page
size of 1-GByte (see Section 5.5.5). Otherwise, the translation process continues as described below.
Use of the PDPTE depends on its PS flag (bit 7):1
• If the PDPTE’s PS flag is 1, the PDPTE maps a 1-GByte page (see Table 5-16). The final physical address is
computed as follows:
— Bits 51:30 are from the PDPTE.
— Bits 29:0 are from the original linear address.
The linear address’s protection key is the value of bits 62:59 of the PDPTE (see Section 5.6.2).
1. The PS flag of a PDPTE is reserved and must be 0 (if the P flag is 1) if 1-GByte pages are not supported. See Section 5.1.4 for how
to determine whether 1-GByte pages are supported.
5-24 Vol. 3A
PAGING
• If the PDPTE’s PS flag is 0, a 4-KByte naturally aligned page directory is located at the physical address
specified in bits 51:12 of the PDPTE (see Table 5-17). A page directory comprises 512 64-bit entries (PDEs). A
PDE is selected using the physical address defined as follows:
— Bits 51:12 are from the PDPTE.
— Bits 11:3 are bits 29:21 of the linear address.
— Bits 2:0 are all 0.
Because a PDE is identified using bits 47:21 of the linear address, it controls access to a 2-MByte region of the
linear-address space.
With HLAT paging, if bit 11 of the PDE is 1, translation is restarted with ordinary paging with a maximum page size
of 2-MBytes (see Section 5.5.5). Otherwise, the translation process continues as described below.
Use of the PDE depends on its PS flag:
• If the PDE's PS flag is 1, the PDE maps a 2-MByte page (see Table 5-18). The final physical address is computed
as follows:
— Bits 51:21 are from the PDE.
— Bits 20:0 are from the original linear address.
The linear address’s protection key is the value of bits 62:59 of the PDE (see Section 5.6.2).
• If the PDE’s PS flag is 0, a 4-KByte naturally aligned page table is located at the physical address specified in
bits 51:12 of the PDE (see Table 5-19). A page table comprises 512 64-bit entries (PTEs). A PTE is selected
using the physical address defined as follows:
— Bits 51:12 are from the PDE.
— Bits 11:3 are bits 20:12 of the linear address.
— Bits 2:0 are all 0.
• Because a PTE is identified using bits 47:12 of the linear address, every PTE maps a 4-KByte page (see
Table 5-20).
With HLAT paging, if bit 11 of the PTE is 1, translation is restarted with ordinary paging with a maximum page
size of 4-KBytes (see Section 5.5.5). Otherwise, the final physical address is computed as follows:
— Bits 51:12 are from the PTE.
— Bits 11:0 are from the original linear address.
The linear address’s protection key is the value of bits 62:59 of the PTE (see Section 5.6.2).
If a paging-structure entry’s P flag (bit 0) is 0 or if the entry sets any reserved bit, the entry is used neither to refer-
ence another paging-structure entry nor to map a page. There is no translation for a linear address whose transla-
tion would use such a paging-structure entry; a reference to such a linear address causes a page-fault exception
(see Section 5.7).
The following bits in a paging-structure entry are reserved with 4-level paging and 5-level paging (assuming that
the entry’s P flag is 1):
• Bits 51:MAXPHYADDR are reserved in every paging-structure entry.
• The PS flag is reserved in a PML5E or a PML4E.
• If 1-GByte pages are not supported, the PS flag is reserved in a PDPTE.1
• If the PS flag in a PDPTE is 1, bits 29:13 of the entry are reserved.
• If the PS flag in a PDE is 1, bits 20:13 of the entry are reserved.
• If IA32_EFER.NXE = 0, the XD flag (bit 63) is reserved in every paging-structure entry.
A reference using a linear address that is successfully translated to a physical address is performed only if allowed
by the access rights of the translation; see Section 5.6.
1. See Section 5.1.4 for how to determine whether 1-GByte pages are supported.
Vol. 3A 5-25
PAGING
Figure 5-11 gives a summary of the formats of CR3 and the 4-level and 5-level paging-structure entries. For the
paging structure entries, it identifies separately the format of entries that map pages, those that reference other
paging structures, and those that do neither because they are “not present”; bit 0 (P) and bit 7 (PS) are highlighted
because they determine how a paging-structure entry is used.
Table 5-14. Format of a PML5 Entry (PML5E) that References a PML4 Table
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 256-TByte region controlled by this entry (see Section 5.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 256-TByte region controlled by this entry (see
Section 5.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the PML4 table referenced by this
entry (see Section 5.9.2)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the PML4 table referenced by this
entry (see Section 5.9.2)
5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 5.8)
6 Ignored
10:8 Ignored
11 (R) For ordinary paging, ignored; for HLAT paging, restart (if 1, linear-address translation is restarted with ordinary
paging)
M–1:12 Physical address of 4-KByte aligned PML4 table referenced by this entry
62:52 Ignored
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 256-TByte region
controlled by this entry; see Section 5.6); otherwise, reserved (must be 0)
Table 5-15. Format of a PML4 Entry (PML4E) that References a Page-Directory-Pointer Table
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 512-GByte region controlled by this entry (see Section 5.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 512-GByte region controlled by this entry (see
Section 5.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page-directory-pointer table
referenced by this entry (see Section 5.9.2)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page-directory-pointer table
referenced by this entry (see Section 5.9.2)
5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 5.8)
6 Ignored
5-26 Vol. 3A
PAGING
Table 5-15. Format of a PML4 Entry (PML4E) that References a Page-Directory-Pointer Table (Contd.)
Bit Contents
Position(s)
10:8 Ignored
11 (R) For ordinary paging, ignored; for HLAT paging, restart (if 1, linear-address translation is restarted with ordinary
paging)
M–1:12 Physical address of 4-KByte aligned page-directory-pointer table referenced by this entry
62:52 Ignored
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 512-GByte region
controlled by this entry; see Section 5.6); otherwise, reserved (must be 0)
Table 5-16. Format of a Page-Directory-Pointer-Table Entry (PDPTE) that Maps a 1-GByte Page
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 1-GByte page referenced by this entry (see Section 5.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 1-GByte page referenced by this entry (see Section
5.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 1-GByte page referenced by this
entry (see Section 5.9.2)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 1-GByte page referenced by this
entry (see Section 5.9.2)
5 (A) Accessed; indicates whether software has accessed the 1-GByte page referenced by this entry (see Section 5.8)
6 (D) Dirty; indicates whether software has written to the 1-GByte page referenced by this entry (see Section 5.8)
7 (PS) Page size; must be 1 (otherwise, this entry references a page directory; see Table 5-17)
8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 5.10); ignored otherwise
10:9 Ignored
11 (R) For ordinary paging, ignored; for HLAT paging, restart (if 1, linear-address translation is restarted with ordinary
paging)
12 (PAT) Indirectly determines the memory type used to access the 1-GByte page referenced by this entry (see Section
5.9.2)1
Vol. 3A 5-27
PAGING
Table 5-16. Format of a Page-Directory-Pointer-Table Entry (PDPTE) that Maps a 1-GByte Page (Contd.)
Bit Contents
Position(s)
58:52 Ignored
62:59 Protection key; if CR4.PKE = 1 or CR4.PKS = 1, this may control the page’s access rights (see Section 5.6.2); otherwise,
it is ignored and not used to control access rights.
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 1-GByte page controlled by
this entry; see Section 5.6); otherwise, reserved (must be 0)
NOTES:
1. The PAT is supported on all processors that support 4-level paging.
Table 5-17. Format of a Page-Directory-Pointer-Table Entry (PDPTE) that References a Page Directory
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 1-GByte region controlled by this entry (see Section 5.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 1-GByte region controlled by this entry (see Section
5.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page directory referenced by
this entry (see Section 5.9.2)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page directory referenced by
this entry (see Section 5.9.2)
5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 5.8)
6 Ignored
7 (PS) Page size; must be 0 (otherwise, this entry maps a 1-GByte page; see Table 5-16)
10:8 Ignored
11 (R) For ordinary paging, ignored; for HLAT paging, restart (if 1, linear-address translation is restarted with ordinary
paging)
(M–1):12 Physical address of 4-KByte aligned page directory referenced by this entry
62:52 Ignored
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 1-GByte region controlled
by this entry; see Section 5.6); otherwise, reserved (must be 0)
5-28 Vol. 3A
PAGING
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 2-MByte page referenced by this entry (see Section 5.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 2-MByte page referenced by this entry (see Section
5.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 2-MByte page referenced by
this entry (see Section 5.9.2)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 2-MByte page referenced by
this entry (see Section 5.9.2)
5 (A) Accessed; indicates whether software has accessed the 2-MByte page referenced by this entry (see Section 5.8)
6 (D) Dirty; indicates whether software has written to the 2-MByte page referenced by this entry (see Section 5.8)
7 (PS) Page size; must be 1 (otherwise, this entry references a page table; see Table 5-19)
8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 5.10); ignored otherwise
10:9 Ignored
11 (R) For ordinary paging, ignored; for HLAT paging, restart (if 1, linear-address translation is restarted with ordinary
paging)
12 (PAT) Indirectly determines the memory type used to access the 2-MByte page referenced by this entry (see Section
5.9.2)
58:52 Ignored
62:59 Protection key; if CR4.PKE = 1 or CR4.PKS = 1, this may control the page’s access rights (see Section 5.6.2);
otherwise, it is ignored and not used to control access rights.
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 2-MByte page controlled by
this entry; see Section 5.6); otherwise, reserved (must be 0)
Vol. 3A 5-29
PAGING
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 2-MByte region controlled by this entry (see Section 5.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 2-MByte region controlled by this entry (see Section
5.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 5.9.2)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the page table referenced by this
entry (see Section 5.9.2)
5 (A) Accessed; indicates whether this entry has been used for linear-address translation (see Section 5.8)
6 Ignored
7 (PS) Page size; must be 0 (otherwise, this entry maps a 2-MByte page; see Table 5-18)
10:8 Ignored
11 (R) For ordinary paging, ignored; for HLAT paging, restart (if 1, linear-address translation is restarted with ordinary
paging)
(M–1):12 Physical address of 4-KByte aligned page table referenced by this entry
62:52 Ignored
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 2-MByte region controlled
by this entry; see Section 5.6); otherwise, reserved (must be 0)
5-30 Vol. 3A
PAGING
Bit Contents
Position(s)
1 (R/W) Read/write; if 0, writes may not be allowed to the 4-KByte page referenced by this entry (see Section 5.6)
2 (U/S) User/supervisor; if 0, user-mode accesses are not allowed to the 4-KByte page referenced by this entry (see Section
5.6)
3 (PWT) Page-level write-through; indirectly determines the memory type used to access the 4-KByte page referenced by
this entry (see Section 5.9.2)
4 (PCD) Page-level cache disable; indirectly determines the memory type used to access the 4-KByte page referenced by this
entry (see Section 5.9.2)
5 (A) Accessed; indicates whether software has accessed the 4-KByte page referenced by this entry (see Section 5.8)
6 (D) Dirty; indicates whether software has written to the 4-KByte page referenced by this entry (see Section 5.8)
7 (PAT) Indirectly determines the memory type used to access the 4-KByte page referenced by this entry (see Section 5.9.2)
8 (G) Global; if CR4.PGE = 1, determines whether the translation is global (see Section 5.10); ignored otherwise
10:9 Ignored
11 (R) For ordinary paging, ignored; for HLAT paging, restart (if 1, linear-address translation is restarted with ordinary
paging)
58:52 Ignored
62:59 Protection key; if CR4.PKE = 1 or CR4.PKS = 1, this may control the page’s access rights (see Section 5.6.2);
otherwise, it is ignored and not used to control access rights.
63 (XD) If IA32_EFER.NXE = 1, execute-disable (if 1, instruction fetches are not allowed from the 4-KByte page controlled by
this entry; see Section 5.6); otherwise, reserved (must be 0)
Vol. 3A 5-31
PAGING
6666555555555 33322222222221111111111
M1 M-1
3210987654321 210987654321098765432109876543210
PP
Reserved2 Address of PML4 table (4-level paging) Ignored C W Ign. CR3
or PML5 table (5-level paging) DT
X I PP R
R Ign. Rs g A C W U / 1 PML5E:
D Ignored Rsvd. Address of PML4 table 4 vd n D T /S W present
3
PML5E:
Ignored 0 not
present
X I PPUR
R Ign. Rs PML4E:
Ignored Rsvd. Address of page-directory-pointer table vd n A D
g C W /S / 1 present
D T W
PML4E:
Ignored 0 not
present
X I PPUR PDPTE:
Ignored Rsvd. Address of page directory R Ign. 0 g A C W /S / 1 page
D n DT W directory
PDTPE:
Ignored 0 not
present
X I PPUR PDE:
Ignored Rsvd. Address of page table R Ign. 0 g A C W /S / 1 page
D n DT W table
PDE:
Ignored 0 not
present
PTE:
Ignored 0 not
present
Figure 5-11. Formats of CR3 and Paging-Structure Entries with 4-Level Paging and 5-Level Paging
NOTES:
1. M is an abbreviation for MAXPHYADDR.
2. Reserved fields must be 0. On processors that support linear-address masking (see Section 4.4), bits 62:61 configure that feature and
may be set to 1. Because linear-address masking is not a paging feature, those bits are not illustrated here.
3. If IA32_EFER.NXE = 0 and the P flag of a paging-structure entry is 1, the XD flag (bit 63) is reserved.
4. Bit 11 is R (restart) only for HLAT paging; it is ignored for ordinary paging.
5. The protection key is used only if software has enabled the appropriate feature; see Section 5.6.2. It is ignored otherwise.
5-32 Vol. 3A
PAGING
NOTE
If HLAT paging is restarted, permissions are determined only by the access rights specified by the
paging-structure entries that the subsequent ordinary paging used to translate the linear address.
The access rights specified by the entries used earlier by HLAT paging do not apply.
Vol. 3A 5-33
PAGING
Shadow-stack accesses are allowed only to shadow-stack addresses. A linear address is a shadow-stack
address if the following are true of the translation of the linear address: (1) the R/W flag (bit 1) is 0 and the dirty
flag (bit 6) is 1 in the paging-structure entry that maps the page containing the linear address; and (2) the R/W
flag is 1 in every other paging-structure entry controlling the translation of the linear address.
The following items detail how paging determines access rights (only the items noted explicitly apply to shadow-
stack accesses):
NOTE
Many of the items below refer to an address with a protection key for which read (or write) access
is permitted. Section 5.6.2 provides details on when a protection key will permit (or not permit) a
data access (read or write) to a linear address using that protection key.
5-34 Vol. 3A
PAGING
• If EFLAGS.AC = 1 and the access is explicit, data may be written to any user-mode address
with a translation for which the R/W flag is 1 in every paging-structure entry controlling the
translation and with a protection key for which write access is permitted; data may not be
written to any user-mode address with a translation for which the R/W flag is 0 in any paging-
structure entry controlling the translation.
• If EFLAGS.AC = 0 or the access is implicit, data may not be written to any user-mode address.
— Instruction fetches from supervisor-mode addresses.
• For 32-bit paging or if IA32_EFER.NXE = 0, instructions may be fetched from any supervisor-mode
address.
• For other paging modes with IA32_EFER.NXE = 1, instructions may be fetched from any supervisor-
mode address with a translation for which the XD flag (bit 63) is 0 in every paging-structure entry
controlling the translation; instructions may not be fetched from any supervisor-mode address with a
translation for which the XD flag is 1 in any paging-structure entry controlling the translation.
— Instruction fetches from user-mode addresses.
Access rights depend on the values of CR4.SMEP:
• If CR4.SMEP = 0, access rights depend on the paging mode and the value of IA32_EFER.NXE:
— For 32-bit paging or if IA32_EFER.NXE = 0, instructions may be fetched from any user-mode
address.
— For other paging modes with IA32_EFER.NXE = 1, instructions may be fetched from any user-
mode address with a translation for which the XD flag is 0 in every paging-structure entry
controlling the translation; instructions may not be fetched from any user-mode address with a
translation for which the XD flag is 1 in any paging-structure entry controlling the translation.
• If CR4.SMEP = 1, instructions may not be fetched from any user-mode address.
— Supervisor-mode shadow-stack accesses are allowed only to supervisor-mode shadow-stack addresses
(see above).
• For user-mode accesses:
— Data reads.
Access rights depend on the mode of the linear address:
• Data may be read from any user-mode address with a protection key for which read access is
permitted.
• Data may not be read from any supervisor-mode address.
— Data writes.
Access rights depend on the mode of the linear address:
• Data may be written to any user-mode address with a translation for which the R/W flag is 1 in every
paging-structure entry controlling the translation and with a protection key for which write access is
permitted.
• Data may not be written to any supervisor-mode address.
— Instruction fetches.
Access rights depend on the mode of the linear address, the paging mode, and the value of
IA32_EFER.NXE:
• For 32-bit paging or if IA32_EFER.NXE = 0, instructions may be fetched from any user-mode address.
• For other paging modes with IA32_EFER.NXE = 1, instructions may be fetched from any user-mode
address with a translation for which the XD flag is 0 in every paging-structure entry controlling the
translation.
• Instructions may not be fetched from any supervisor-mode address.
— User-mode shadow-stack accesses made outside enclave mode are allowed only to user-mode shadow-
stack addresses (see above). User-mode shadow-stack accesses made in enclave mode are treated like
ordinary data accesses (see above).
Vol. 3A 5-35
PAGING
A processor may cache information from the paging-structure entries in TLBs and paging-structure caches (see
Section 5.10). These structures may include information about access rights. The processor may enforce access
rights based on the TLBs and paging-structure caches instead of on the paging structures in memory.
This fact implies that, if software modifies a paging-structure entry to change access rights, the processor might
not use that change for a subsequent access to an affected linear address (see Section 5.10.4.3). See Section
5.10.4.2 for how software can ensure that the processor uses the modified access rights.
5-36 Vol. 3A
PAGING
31 15 7 6 5 4 3 2 1 0
SGX
SS
PK
I/D
RSVD
U/S
W/R
P
HLAT
Reserved Reserved
HLAT 0 The fault occurred during ordinary paging or due to access rights.
1 The fault occurred during HLAT paging.
1. If HLAT paging encounters a paging-structure entry that sets a reserved bit, there is no translation even if the bit 11 of the entry
indicates a restart. In this case, there is a page fault and the translation is not restarted.
Vol. 3A 5-37
PAGING
1. Some past processors had errata for some page faults that occur when there is no translation for the linear address because the P
flag was 0 in one of the paging-structure entries used to translate that address. Due to these errata, some such page faults pro-
duced error codes that cleared bit 0 (P flag) and set bit 3 (RSVD flag).
5-38 Vol. 3A
PAGING
NOTE
If software on one logical processor writes to a page while software on another logical processor
concurrently clears the R/W flag in the paging-structure entry that maps the page, execution on
some processors may result in the entry’s dirty flag being set (due to the write on the first logical
processor) and the entry’s R/W flag being clear (due to the update to the entry on the second
logical processor). This will never occur on a processor that supports control-flow enforcement
technology (CET). Specifically, a processor that supports CET will never set the dirty flag in a
paging-structure entry in which the R/W flag is clear.
Memory-management software may clear these flags when a page or a paging structure is initially loaded into
physical memory. These flags are “sticky,” meaning that, once set, the processor does not clear them; only soft-
ware can clear them.
A processor may cache information from the paging-structure entries in TLBs and paging-structure caches (see
Section 5.10). This fact implies that, if software changes an accessed flag or a dirty flag from 1 to 0, the processor
might not set the corresponding bit in memory on a subsequent access using an affected linear address (see
Section 5.10.4.3). See Section 5.10.4.2 for how software can ensure that these bits are updated as desired.
NOTE
The accesses used by the processor to set these flags may or may not be exposed to the
processor’s self-modifying code detection logic. If the processor is executing code from the same
memory area that is being used for the paging structures, the setting of these flags may or may not
result in an immediate change to the executing code stream.
1. With PAE paging, the PDPTEs are not used during linear-address translation but only to load the PDPTE registers for some execu-
tions of the MOV CR instruction (see Section 5.4.1). For this reason, the PDPTEs do not contain accessed flags with PAE paging.
2. The PAT is supported on Pentium III and more recent processor families. See Section 5.1.4 for how to determine whether the PAT is
supported.
Vol. 3A 5-39
PAGING
5.9.1 Paging and Memory Typing When the PAT is Not Supported (Pentium Pro and Pentium
II Processors)
NOTE
The PAT is supported on all processors that support 4-level paging or 5-level paging. Thus, this
section applies only to 32-bit paging and PAE paging.
If the PAT is not supported, paging contributes to memory typing in conjunction with the memory-type range regis-
ters (MTRRs) as specified in Table 13-6 in Section 13.5.2.1.
For any access to a physical address, the table combines the memory type specified for that physical address by the
MTRRs with a PCD value and a PWT value. The latter two values are determined as follows:
• For an access to a PDE with 32-bit paging, the PCD and PWT values come from CR3.
• For an access to a PDE with PAE paging, the PCD and PWT values come from the relevant PDPTE register.
• For an access to a PTE, the PCD and PWT values come from the relevant PDE.
• For an access to the physical address that is the translation of a linear address, the PCD and PWT values come
from the relevant PTE (if the translation uses a 4-KByte page) or the relevant PDE (otherwise).
• With PAE paging, the UC memory type is used when loading the PDPTEs (see Section 5.4.1).
5.9.2 Paging and Memory Typing When the PAT is Supported (Pentium III and More Recent
Processor Families)
If the PAT is supported, paging contributes to memory typing in conjunction with the PAT and the memory-type
range registers (MTRRs) as specified in Table 13-7 in Section 13.5.2.2.
The PAT is a 64-bit MSR (IA32_PAT; MSR index 277H) comprising eight (8) 8-bit entries (entry i comprises
bits 8i+7:8i of the MSR).
For any access to a physical address, the table combines the memory type specified for that physical address by the
MTRRs with a memory type selected from the PAT. Table 13-11 in Section 13.12.3 specifies how a memory type is
selected from the PAT. Specifically, it comes from entry i of the PAT, where i is defined as follows:
• For an access to an entry in a paging structure whose address is in CR3 (e.g., the PML4 table with 4-level
paging):
— For 4-level paging or 5-level paging with CR4.PCIDE = 1, i = 0.
— Otherwise, i = 2*PCD+PWT, where the PCD and PWT values come from CR3.
• For an access to a PDE with PAE paging, i = 2*PCD+PWT, where the PCD and PWT values come from the
relevant PDPTE register.
• For an access to a paging-structure entry X whose address is in another paging-structure entry Y, i =
2*PCD+PWT, where the PCD and PWT values come from Y.
• For an access to the physical address that is the translation of a linear address, i = 4*PAT+2*PCD+PWT, where
the PAT, PCD, and PWT values come from the relevant PTE (if the translation uses a 4-KByte page), the relevant
PDE (if the translation uses a 2-MByte page or a 4-MByte page), or the relevant PDPTE (if the translation uses
a 1-GByte page).
• With PAE paging, the WB memory type is used when loading the PDPTEs (see Section 5.4.1).1
1. Some older IA-32 processors used the UC memory type when loading the PDPTEs. Some processors may use the UC memory type if
CR0.CD = 1 or if the MTRRs are disabled. These behaviors are model-specific and not architectural.
5-40 Vol. 3A
PAGING
1. Note that, while HLAT paging (Section 5.5.3) does not use CR3 to locate the first paging structure, it does use the PCID in CR3[11:0]
when CR4.PCIDE = 1.
Vol. 3A 5-41
PAGING
NOTE
In revisions of this manual that were produced when no processors allowed CR4.PCIDE to be set to
1, Section 5.10, “Caching Translation Information,” discussed the caching of translation information
without any reference to PCIDs. While the section now refers to PCIDs in its specification of this
caching, this documentation change is not intended to imply any change to the behavior of
processors that do not allow CR4.PCIDE to be set to 1.
5-42 Vol. 3A
PAGING
Vol. 3A 5-43
PAGING
If software modifies the paging structures so that the page size used for a 4-KByte range of linear addresses
changes, the TLBs may subsequently contain multiple translations for the address range (one for each page size).
A reference to a linear address in the address range may use any of these translations. Which translation is used
may vary from one execution to another, and the choice may be implementation-specific.
5-44 Vol. 3A
PAGING
— The physical address from the PML4E (the address of the page-directory-pointer table).
— The logical-AND of the R/W flags in the PML5E and the PML4E.
— The logical-AND of the U/S flags in the PML5E and the PML4E.
— The logical-OR of the XD flags in the PML5E and the PML4E.
— The values of the PCD and PWT flags of the PML4E.
The following items detail how a processor may use the PML4E cache:
— If the processor has a PML4E-cache entry for a linear address, it may use that entry when translating the
linear address (instead of the PML5E and PML4E in memory).
— The processor does not create a PML4E-cache entry unless the P flags are 1 and all reserved bits are 0 in
the PML5E and the PML4E in memory.
— The processor does not create a PML4E-cache entry unless the accessed flags are 1 in the PML5E and the
PML4E in memory; before caching a translation, the processor sets any accessed flags that are not already
1.
— The processor may create a PML4E-cache entry even if there are no translations for any linear address that
might use that entry (e.g., because the P flags are 0 in all entries in the referenced page-directory-pointer
table).
— If the processor creates a PML4E-cache entry, the processor may retain it unmodified even if software
subsequently modifies the corresponding PML4E in memory.
• PDPTE cache (4-level paging and 5-level paging only).1 The use of the PML4E cache depends on the paging
mode:
— For 4-level paging, each PDPTE-cache entry is referenced by an 18-bit value and is used for linear
addresses for which bits 47:30 have that value.
— For 5-level paging, each PDPTE-cache entry is referenced by a 27-bit value and is used for linear addresses
for which bits 56:30 have that value.
A PDPTE-cache entry contains information from the PML5E, PML4E, PDPTE used to translate the relevant linear
addresses (for 4-level paging, the PML5E does not apply):
— The physical address from the PDPTE (the address of the page directory). (No PDPTE-cache entry is created
for a PDPTE that maps a 1-GByte page.)
— The logical-AND of the R/W flags in the PML5E, PML4E, and PDPTE.
— The logical-AND of the U/S flags in the PML5E, PML4E, and PDPTE.
— The logical-OR of the XD flags in the PML5E, PML4E, and PDPTE.
— The values of the PCD and PWT flags of the PDPTE.
The following items detail how a processor may use the PDPTE cache:
— If the processor has a PDPTE-cache entry for a linear address, it may use that entry when translating the
linear address (instead of the PML5E, PML4E, and PDPTE in memory).
— The processor does not create a PDPTE-cache entry unless the P flags are 1, the PS flags are 0, and the
reserved bits are 0 in the PML5E, PML4E, and PDPTE in memory.
— The processor does not create a PDPTE-cache entry unless the accessed flags are 1 in the PML5E, PML4E,
and PDPTE in memory; before caching a translation, the processor sets any accessed flags that are not
already 1.
— The processor may create a PDPTE-cache entry even if there are no translations for any linear address that
might use that entry.
— If the processor creates a PDPTE-cache entry, the processor may retain it unmodified even if software
subsequently modifies the corresponding PML5E, PML4E, or PDPTE in memory.
1. With PAE paging, the PDPTEs are stored in internal, non-architectural registers. The operation of these registers is described in Sec-
tion 5.4.1 and differs from that described here.
Vol. 3A 5-45
PAGING
• PDE cache. The use of the PDE cache depends on the paging mode:
— For 32-bit paging, each PDE-cache entry is referenced by a 10-bit value and is used for linear addresses for
which bits 31:22 have that value.
— For PAE paging, each PDE-cache entry is referenced by an 11-bit value and is used for linear addresses for
which bits 31:21 have that value.
— For 4-level paging, each PDE-cache entry is referenced by a 27-bit value and is used for linear addresses for
which bits 47:21 have that value.
— For 5-level paging, each PDE-cache entry is referenced by a 36-bit value and is used for linear addresses for
which bits 56:21 have that value.
A PDE-cache entry contains information from the PML5E, PML4E, PDPTE, and PDE used to translate the relevant
linear addresses (for 32-bit paging and PAE paging, only the PDE applies; for 4-level paging, the PML5E does
not apply):
— The physical address from the PDE (the address of the page table). (No PDE-cache entry is created for a
PDE that maps a page.)
— The logical-AND of the R/W flags in the PML5E, PML4E, PDPTE, and PDE.
— The logical-AND of the U/S flags in the PML5E, PML4E, PDPTE, and PDE.
— The logical-OR of the XD flags in the PML5E, PML4E, PDPTE, and PDE.
— The values of the PCD and PWT flags of the PDE.
The following items detail how a processor may use the PDE cache (references below to PML5Es, PML4Es, and
PDPTEs apply only to 4-level paging and to 5-level paging, as appropriate):
— If the processor has a PDE-cache entry for a linear address, it may use that entry when translating the
linear address (instead of the PML5E, PML4E, PDPTE, and PDE in memory).
— The processor does not create a PDE-cache entry unless the P flags are 1, the PS flags are 0, and the
reserved bits are 0 in the PML5E, PML4E, PDPTE, and PDE in memory.
— The processor does not create a PDE-cache entry unless the accessed flag is 1 in the PML5E, PML4E, PDPTE,
and PDE in memory; before caching a translation, the processor sets any accessed flags that are not
already 1.
— The processor may create a PDE-cache entry even if there are no translations for any linear address that
might use that entry.
— If the processor creates a PDE-cache entry, the processor may retain it unmodified even if software subse-
quently modifies the corresponding PML5E, PML4E, PDPTE, or PDE in memory.
Information from a paging-structure entry can be included in entries in the paging-structure caches for other
paging-structure entries referenced by the original entry. For example, if the R/W flag is 0 in a PML4E, then the R/W
flag will be 0 in any PDPTE-cache entry for a PDPTE from the page-directory-pointer table referenced by that
PML4E. This is because the R/W flag of each such PDPTE-cache entry is the logical-AND of the R/W flags in the
appropriate PML4E and PDPTE.
On processors that support HLAT paging (see Section 5.5.1), each entry in a paging-structure cache indicates
whether the entry was cached during ordinary paging or HLAT paging. When the processor commences linear-
address translation using ordinary paging (respectively, HLAT paging), it will use only entries that indicate that they
were cached during ordinary paging (respectively, HLAT paging).
Entries that were cached during HLAT paging also include the restart flag (bit 11) of the original paging-structure
entry. When the processor commences HLAT paging using such an entry, it immediately restarts (using ordinary
paging) if this cached restart flag is 1.
The paging-structure caches contain information only from paging-structure entries that reference other paging
structures (and not those that map pages). Because the G flag is not used in such paging-structure entries, the
global-page feature does not affect the behavior of the paging-structure caches.
The processor may create entries in paging-structure caches for translations required for prefetches and for
accesses that are a result of speculative execution that would never actually occur in the executed code path.
5-46 Vol. 3A
PAGING
As noted in Section 5.10.1, any entries created in paging-structure caches by a logical processor are associated
with the current PCID.
A processor may or may not implement any of the paging-structure caches. Software should rely on neither their
presence nor their absence. The processor may invalidate entries in these caches at any time. Because the
processor may create the cache entries at the time of translation and not update them following subsequent modi-
fications to the paging structures in memory, software should take care to invalidate the cache entries appropri-
ately when causing such modifications. The invalidation of TLBs and the paging-structure caches is described in
Section 5.10.4.
Vol. 3A 5-47
PAGING
— Any PML4-cache entry associated with linear addresses with 0 in bits 47:39 contains address X.
— Any PDPTE-cache entry associated with linear addresses with 0 in bits 47:30 contains address X. This is
because the translation for a linear address for which the value of bits 47:30 is 0 uses the value of
bits 47:39 (0) to locate a page-directory-pointer table at address X (the address of the PML4 table). It then
uses the value of bits 38:30 (also 0) to find address X again and to store that address in the PDPTE-cache
entry.
— Any PDE-cache entry associated with linear addresses with 0 in bits 47:21 contains address X for similar
reasons.
— Any TLB entry for page number 0 (associated with linear addresses with 0 in bits 47:12) translates to page
frame X » 12 for similar reasons.
The same PML4E contributes its address X to all these cache entries because the self-referencing nature of the
entry causes it to be used as a PML4E, a PDPTE, a PDE, and a PTE.
1. If the paging structures map the linear address using a page larger than 4 KBytes and there are multiple TLB entries for that page
(see Section 5.10.2.3), the instruction invalidates all of them.
2. If the paging structures map the linear address using a page larger than 4 KBytes and there are multiple TLB entries for that page
(see Section 5.10.2.3), the instruction invalidates all of them.
5-48 Vol. 3A
PAGING
— If CR4.PCIDE = 0, the instruction invalidates all TLB entries associated with PCID 000H except those for
global pages. It also invalidates all entries in all paging-structure caches associated with PCID 000H.
— If CR4.PCIDE = 1 and bit 63 of the instruction’s source operand is 0, the instruction invalidates all TLB
entries associated with the PCID specified in bits 11:0 of the instruction’s source operand except those for
global pages. It also invalidates all entries in all paging-structure caches associated with that PCID. It is not
required to invalidate entries in the TLBs and paging-structure caches that are associated with other PCIDs.
— If CR4.PCIDE = 1 and bit 63 of the instruction’s source operand is 1, the instruction is not required to
invalidate any TLB entries or entries in paging-structure caches.
• MOV to CR4. The behavior of the instruction depends on the bits being modified:
— The instruction invalidates all TLB entries (including global entries) and all entries in all paging-structure
caches (for all PCIDs) if (1) it changes the value of CR4.PGE;1 or (2) it changes the value of the CR4.PCIDE
from 1 to 0.
— The instruction invalidates all TLB entries and all entries in all paging-structure caches for the current PCID
if (1) it changes the value of CR4.PAE; or (2) it changes the value of CR4.SMEP from 0 to 1.
• Task switch. If a task switch changes the value of CR3, it invalidates all TLB entries associated with PCID 000H
except those for global pages. It also invalidates all entries in all paging-structure caches associated with PCID
000H.2
• VMX transitions. See Section 5.11.1.
The processor is always free to invalidate additional entries in the TLBs and paging-structure caches. The following
are some examples:
• INVLPG may invalidate TLB entries for pages other than the one corresponding to its linear-address operand. It
may invalidate TLB entries and paging-structure-cache entries associated with PCIDs other than the current
PCID.
• INVPCID may invalidate TLB entries for pages other than the one corresponding to the specified linear address.
It may invalidate TLB entries and paging-structure-cache entries associated with PCIDs other than the
specified PCID.
• MOV to CR0 may invalidate TLB entries even if CR0.PG is not changing. For example, this may occur if either
CR0.CD or CR0.NW is modified.
• MOV to CR3 may invalidate TLB entries for global pages. If CR4.PCIDE = 1 and bit 63 of the instruction’s source
operand is 0, it may invalidate TLB entries and entries in the paging-structure caches associated with PCIDs
other than the PCID it is establishing. It may invalidate entries if CR4.PCIDE = 1 and bit 63 of the instruction’s
source operand is 1.
• MOV to CR4 may invalidate TLB entries when changing CR4.PSE or when changing CR4.SMEP from 1 to 0.
• On a processor supporting Hyper-Threading Technology, invalidations performed on one logical processor may
invalidate entries in the TLBs and paging-structure caches used by other logical processors.
(Other instructions and operations may invalidate entries in the TLBs and the paging-structure caches, but the
instructions identified above are recommended.)
In addition to the instructions identified above, page faults invalidate entries in the TLBs and paging-structure
caches. In particular, a page-fault exception resulting from an attempt to use a linear address will invalidate any
TLB entries that are for a page number corresponding to that linear address and that are associated with the
current PCID. It also invalidates all entries in the paging-structure caches that would be used for that linear address
and that are associated with the current PCID.3 These invalidations ensure that the page-fault exception will not
recur (if the faulting instruction is re-executed) if it would not be caused by the contents of the paging structures
1. If CR4.PGE is changing from 0 to 1, there were no global TLB entries before the execution; if CR4.PGE is changing from 1 to 0, there
will be no global TLB entries after the execution.
2. Task switches do not occur in IA-32e mode and thus cannot occur with 4-level paging. Since CR4.PCIDE can be set only with 4-level
paging, task switches occur only with CR4.PCIDE = 0.
3. Unlike INVLPG, page faults need not invalidate all entries in the paging-structure caches, only those that would be used to translate
the faulting linear address.
Vol. 3A 5-49
PAGING
in memory (and if, therefore, it resulted from cached entries that were not invalidated after the paging structures
were modified in memory).
As noted in Section 5.10.2, some processors may choose to cache multiple smaller-page TLB entries for a transla-
tion specified by the paging structures to use a page larger than 4 KBytes. There is no way for software to be aware
that multiple translations for smaller pages have been used for a large page. The INVLPG instruction and page
faults provide the same assurances that they provide when a single TLB entry is used: they invalidate all TLB
entries corresponding to the translation specified by the paging structures.
1. One execution of INVLPG is sufficient even for a page with size greater than 4 KBytes.
5-50 Vol. 3A
PAGING
PDE); then invalidate any translations for the affected linear addresses (see above); and then modify the
relevant paging-structure entry to set the P flag and establish modified translation(s) for the new page size.
• Software should clear bit 63 of the source operand to a MOV to CR3 instruction that establishes a PCID that had
been used earlier for a different linear-address space (e.g., with a different value in bits 51:12 of CR3). This
ensures invalidation of any information that may have been cached for the previous linear-address space.
This assumes that both linear-address spaces use the same global pages and that it is thus not necessary to
invalidate any global TLB entries. If that is not the case, software should invalidate those entries by executing
MOV to CR4 to modify CR4.PGE.
1. If it is also the case that no invalidation was performed the last time the P flag was changed from 1 to 0, the processor may use a
TLB entry or paging-structure cache entry that was created when the P flag had earlier been 1.
Vol. 3A 5-51
PAGING
1. If the accesses are to different pages, this may occur even if invalidation has not been delayed.
5-52 Vol. 3A
PAGING
1. Begin barrier: Stop all but one logical processor; that is, cause all but one to execute the HLT instruction or to
enter a spin loop.
2. Allow the active logical processor to change the necessary paging-structure entries.
3. Allow all logical processors to perform invalidations appropriate to the modifications to the paging-structure
entries.
4. Allow all logical processors to resume normal operation.
Alternative, performance-optimized, TLB shootdown algorithms may be developed; however, software developers
must take care to ensure that the following conditions are met:
• All logical processors that are using the paging structures that are being modified must participate and perform
appropriate invalidations after the modifications are made.
• If the modifications to the paging-structure entries are made before the barrier or if there is no barrier, the
operating system must ensure one of the following: (1) that the affected linear-address range is not used
between the time of modification and the time of invalidation; or (2) that it is prepared to deal with the conse-
quences of the affected linear-address range being used during that period. For example, if the operating
system does not allow pages being freed to be reallocated for another purpose until after the required invalida-
tions, writes to those pages by errant software will not unexpectedly modify memory that is in use.
• Software must be prepared to deal with reads, instruction fetches, and prefetch requests to the affected linear-
address range that are a result of speculative execution that would never actually occur in the executed code
path.
When multiple logical processors are using the same linear-address space at the same time, they must coordinate
before any request to modify the paging-structure entries that control that linear-address space. In these cases,
the barrier in the TLB shootdown routine may not be required. For example, when freeing a range of linear
addresses, some other mechanism can assure no logical processor is using that range before the request to free it
is made. In this case, a logical processor freeing the range can clear the P flags in the PTEs associated with the
range, free the physical page frames associated with the range, and then signal the other logical processors using
that linear-address space to perform the necessary invalidations. All the affected logical processors must complete
their invalidations before the linear-address range and the physical page frames previously associated with that
range can be reallocated.
Vol. 3A 5-53
PAGING
• VMX transitions invalidate the TLBs and paging-structure caches based on certain control settings. See Section
28.3.2.5 and Section 29.5.5 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3C.
5-54 Vol. 3A
PAGING
The Intel-64 and IA-32 architectures do not enforce correspondence between the boundaries of pages and
segments. A page can contain the end of one segment and the beginning of another. Similarly, a segment can
contain the end of one page and the beginning of another.
Memory-management software may be simpler and more efficient if it enforces some alignment between page and
segment boundaries. For example, if a segment which can fit in one page is placed in two pages, there may be
twice as much paging overhead to support access to that segment.
One approach to combining paging and segmentation that simplifies memory-management software is to give
each segment its own page table, as shown in Figure 5-13. This convention gives the segment a single entry in the
page directory, and this entry provides the access control information for paging the entire segment.
Page Frames
PTE
PTE
PTE
Seg. Descript. PDE
Seg. Descript. PDE
PTE
PTE
Figure 5-13. Memory Management Convention That Assigns a Page Table to Each Segment
Vol. 3A 5-55
PAGING
5-56 Vol. 3A
15.Updates to Chapter 7, Volume 3A
Change bars and violet text show changes to Chapter 7 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.
------------------------------------------------------------------------------------------
Changes to this chapter:
• Minor updates, mainly about saving CR2 by page faults.
This chapter describes the interrupt and exception-handling mechanism when operating in protected mode on an
Intel 64 or IA-32 processor. Most of the information provided here also applies to interrupt and exception mecha-
nisms used in real-address, virtual-8086 mode, and 64-bit mode.
Chapter 22, “8086 Emulation,” describes information specific to interrupt and exception mechanisms in real-
address and virtual-8086 mode. Section 7.14, “Exception and Interrupt Handling in 64-bit Mode,” describes infor-
mation specific to interrupt and exception mechanisms in IA-32e mode and 64-bit sub-mode.
Vol. 3A 7-1
INTERRUPT AND EXCEPTION HANDLING
Table 7-1 shows vector number assignments for architecturally defined exceptions and for the NMI interrupt. This
table gives the exception type (see Section 7.5, “Exception Classifications”) and indicates whether an error code is
saved on the stack for the exception. The source of each predefined exception and the NMI interrupt is also given.
7-2 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
The processor’s local APIC is normally connected to a system-based I/O APIC. Here, external interrupts received at
the I/O APIC’s pins can be directed to the local APIC through the system bus (Pentium 4, Intel Core Duo, Intel Core
2, Intel Atom, and Intel Xeon processors) or the APIC serial bus (P6 family and Pentium processors). The I/O APIC
determines the vector number of the interrupt and sends this number to the local APIC. When a system contains
multiple processors, processors can also send interrupts to one another by means of the system bus (Pentium 4,
Intel Core Duo, Intel Core 2, Intel Atom, and Intel Xeon processors) or the APIC serial bus (P6 family and Pentium
processors).
The LINT[1:0] pins are not available on the Intel486 processor and earlier Pentium processors that do not contain
an on-chip local APIC. These processors have dedicated NMI and INTR pins. With these processors, external inter-
rupts are typically generated by a system-based interrupt controller (8259A), with the interrupts being signaled
through the INTR pin.
Note that several other pins on the processor can cause a processor interrupt to occur. However, these interrupts
are not handled by the interrupt and exception mechanism described in this chapter. These pins include the
RESET#, FLUSH#, STPCLK#, SMI#, R/S#, and INIT# pins. Whether they are included on a particular processor is
implementation dependent. Pin functions are described in the data books for the individual processors. The SMI#
pin is described in Chapter 33, “System Management Mode.”
Vol. 3A 7-3
INTERRUPT AND EXCEPTION HANDLING
all IA-32 architecture defined interrupt vectors from 0 through 255; those that can be delivered through the local
APIC include interrupt vectors 16 through 255.
The IF flag in the EFLAGS register permits all maskable hardware interrupts to be masked as a group (see Section
7.8.1, “Masking Maskable Hardware Interrupts”). Note that when interrupts 0 through 15 are delivered through the
local APIC, the APIC indicates the receipt of an illegal vector.
1. The INT n instruction has opcode CD following by an immediate byte encoding the value of n. In contrast, INT1 has opcode F1 and
INT3 has opcode CC.
7-4 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
dent. When a machine-check error is detected, the processor signals a machine-check exception (vector 18) and
returns an error code.
See Chapter 7, “Interrupt 18—Machine-Check Exception (#MC),” and Chapter 17, “Machine-Check Architecture,”
for more information about the machine-check mechanism.
NOTE
One exception subset normally reported as a fault is not restartable. Such exceptions result in loss
of some processor state. For example, executing a POPAD instruction where the stack frame
crosses over the end of the stack segment causes a fault to be reported. In this situation, the
exception handler sees that the instruction pointer (CS:EIP) has been restored as if the POPAD
instruction had not been executed. However, internal processor state (the general-purpose
registers) will have been modified. Such cases are considered programming errors. An application
causing this class of exceptions should be terminated by the operating system.
Vol. 3A 7-5
INTERRUPT AND EXCEPTION HANDLING
The abort-class exceptions do not support reliable restarting of the program or task. Abort handlers are designed
to collect diagnostic information about the state of the processor when the abort exception occurred and then shut
down the application and system as gracefully as possible.
Interrupts rigorously support restarting of interrupted programs and tasks without loss of continuity. The return
instruction pointer saved for an interrupt points to the next instruction to be executed at the instruction boundary
where the processor took the interrupt. If the instruction just executed has a repeat prefix, the interrupt is taken
at the end of the current iteration with the registers set to execute the next iteration.
The ability of a P6 family processor to speculatively execute instructions does not affect the taking of interrupts by
the processor. Interrupts are taken at instruction boundaries located during the retirement phase of instruction
execution; so they are always taken in the “in-order” instruction stream. See Chapter 2, “Intel® 64 and IA-32
Architectures,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for more informa-
tion about the P6 family processors’ microarchitecture and its support for out-of-order instruction execution.
Note that the Pentium processor and earlier IA-32 processors also perform varying amounts of prefetching and
preliminary decoding. With these processors as well, exceptions and interrupts are not signaled until actual “in-
order” execution of the instructions. For a given code sample, the signaling of exceptions occurs uniformly when
the code is executed on any family of IA-32 processors (except where new exceptions or new opcodes have been
defined).
7-6 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
1. The effect of the IOPL on these instructions is modified slightly when the virtual mode extension is enabled by setting the VME flag
in control register CR4: see Section 22.3, “Interrupt and Exception Handling in Virtual-8086 Mode.” Behavior is also impacted by the
PVI flag: see Section 22.4, “Protected-Mode Virtual Interrupts.”
2. Nonmaskable interrupts and system-management interrupts may also be inhibited on the instruction boundary following such an
execution of STI.
Vol. 3A 7-7
INTERRUPT AND EXCEPTION HANDLING
7-8 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
The processor first services a pending event from the class which has the highest priority, transferring execution to
the first instruction of the handler. Lower priority exceptions are discarded; lower priority interrupts are held
pending. Discarded exceptions may be re-generated when the event handler returns execution to the point in the
program or task where the original event occurred. While the priority among the classes listed in Table 7-2 is
consistent across processor implementations, the priority of events within a class is implementation-dependent
and may vary from processor to processor.
Table 7-2 specifies the prioritization of events that may be pending at an instruction boundary. It does not specify
the prioritization of faults that arise during instruction execution or event delivery (these include #BR, #TS, #NP,
#SS, #GP, #PF, #AC, #MF, #XM, #VE, or #CP). It also does not apply to the events generated by the “Call to Inter-
rupt Procedure” instructions (INT n, INTO, INT3, and INT1), as these events are integral to the execution of those
instructions and do not occur between instructions.
Vol. 3A 7-9
INTERRUPT AND EXCEPTION HANDLING
NOTE
Because interrupts are delivered to the processor core only once, an incorrectly configured IDT
could result in incomplete interrupt handling and/or the blocking of interrupt delivery.
IA-32 architecture rules need to be followed for setting up IDTR base/limit/access fields and each
field in the gate descriptors. The same apply for the Intel 64 architecture. This includes implicit
referencing of the destination code segment through the GDT or LDT and accessing the stack.
IDTR Register
47 16 15 0
IDT Base Address IDT Limit
Interrupt
Gate for
Interrupt #3 16
Gate for
Interrupt #2 8
Gate for
Interrupt #1 0
31 0
7-10 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Task Gate
31 16 15 14 13 12 8 7 0
D
P P 0 0 1 0 1 4
L
31 16 15 0
Interrupt Gate
31 16 15 14 13 12 8 7 5 4 0
D
Offset 31..16 P P 0 D 1 1 0 0 0 0 4
L
31 16 15 0
Trap Gate
31 16 15 14 13 12 8 7 5 4 0
D
Offset 31..16 P P 0 D 1 1 1 0 0 0 4
L
31 16 15 0
Vol. 3A 7-11
INTERRUPT AND EXCEPTION HANDLING
Destination
IDT Code Segment
Interrupt
Offset Procedure
Interrupt
Vector
Interrupt or
Trap Gate
+
Segment Selector
GDT or LDT
Base
Address
Segment
Descriptor
7-12 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
ESP Before
EFLAGS Transfer to Handler
CS
EIP
Error Code ESP After
Transfer to Handler
ESP Before
Transfer to Handler SS
ESP
EFLAGS
CS
EIP
ESP After Error Code
Transfer to Handler
To return from an exception- or interrupt-handler procedure, the handler must use the IRET (or IRETD) instruction.
The IRET instruction is similar to the RET instruction except that it restores the saved flags into the EFLAGS
register. The IOPL field of the EFLAGS register is restored only if the CPL is 0. The IF flag is changed only if the CPL
is less than or equal to the IOPL. See Chapter 3, “Instruction Set Reference, A-L,” of the Intel® 64 and IA-32 Archi-
tectures Software Developer’s Manual, Volume 2A, for a description of the complete operation performed by the
IRET instruction.
If a stack switch occurred when calling the handler procedure, the IRET instruction switches back to the interrupted
procedure’s stack on the return.
Vol. 3A 7-13
INTERRUPT AND EXCEPTION HANDLING
7.12.1.1 Shadow Stack Usage on Transfers to Interrupt and Exception Handling Routines
When the processor performs a call to the exception- or interrupt-handler procedure:
• If the handler procedure is going to be execute at a numerically lower privilege level, a shadow stack switch
occurs. When the shadow stack switch occurs:
a. On a transfer from privilege level 3, if shadow stacks are enabled at privilege level 3 then the SSP is saved
to the IA32_PL3_SSP MSR.
b. If shadow stacks are enabled at the privilege level where the handler will execute then the shadow stack for
the handler is obtained from one of the following MSRs based on the privilege level at which the handler
executes.
• IA32_PL2_SSP if handler executes at privilege level 2.
• IA32_PL1_SSP if handler executes at privilege level 1.
• IA32_PL0_SSP if handler executes at privilege level 0.
c. The SSP obtained is then verified to ensure it points to a valid supervisory shadow stack that is not currently
active by verifying a supervisor shadow stack token at the address pointed to by the SSP. The operations
performed to verify and acquire the supervisor shadow stack token by making it busy are as described in
Section 18.2.3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.
d. On this new shadow stack, the processor pushes the CS, LIP (CS.base + EIP), and SSP of the interrupted
procedure if the interrupted procedure was executing at privilege level less than 3; see Figure 7-5.1
• If the handler procedure is going to be executed at the same privilege level as the interrupted procedure and
shadow stacks are enabled at current privilege level:
a. The processor saves the current state of the CS, LIP (CS.base + EIP), and SSP registers on the current
shadow stack; see Figure 7-5.
1. If any of these pushes leads to an exception or a VM exit, the supervisor shadow-stack token remains busy.
7-14 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
SSP Before
Transfer to Handler
CS
LIP
SSP Before
Transfer to Handler
Supervisor
Shadow Stack
SSP After
Token
Transfer to Handler
SSP Before
Transfer to Handler
Supervisor
Shadow Stack
Token
CS
LIP
Figure 7-5. Shadow Stack Usage on Transfers to Interrupt and Exception-Handling Routines
To return from an exception- or interrupt-handler procedure, the handler must use the IRET (or IRETD) instruction.
When executing a return from an interrupt or exception handler from the same privilege level as the interrupted
procedure, the processor performs these actions to enforce return address protection:
• Restores the CS and EIP registers to their values prior to the interrupt or exception.
Vol. 3A 7-15
INTERRUPT AND EXCEPTION HANDLING
1. This check is not performed by execution of the INT1 instruction (opcode F1); it would be performed by execution of INT 1 (opcode
CD 01).
7-16 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Vol. 3A 7-17
INTERRUPT AND EXCEPTION HANDLING
NOTE
Because IA-32 architecture tasks are not re-entrant, an interrupt-handler task must disable
interrupts between the time it completes handling the interrupt and the time it executes the IRET
instruction. This action prevents another interrupt from occurring while the interrupt task’s TSS is
still marked busy, which would cause a general-protection (#GP) exception.
Interrupt
Vector Task Gate
TSS Descriptor
1. The bit is also set if the exception occurred during delivery of INT1.
7-18 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
31 3 2 1 0
T I E
Reserved Segment Selector Index I D X
T T
The segment selector index field provides an index into the IDT, GDT, or current LDT to the segment or gate
selector being referenced by the error code. In some cases the error code is null (all bits are clear except possibly
EXT). A null error code indicates that the error was not caused by a reference to a specific segment or that a null
segment selector was referenced in an operation.
The format of the error code is different for page-fault exceptions (#PF). See the “Interrupt 14—Page-Fault Excep-
tion (#PF)” section in this chapter.
The format of the error code is different for control protection exceptions (#CP). See the “Interrupt 21—Control
Protection Exception (#CP)” section in this chapter.
The error code is pushed on the stack as a doubleword or word (depending on the default interrupt, trap, or task
gate size). To keep the stack aligned for doubleword pushes, the upper half of the error code is reserved. Note that
the error code is not popped when the IRET instruction is executed to return from an exception handler, so the
handler must remove the error code before executing a return.
Error codes are not pushed on the stack for exceptions that are generated externally (with the INTR or LINT[1:0]
pins) or the INT n instruction, even if an error code is normally produced for those exceptions.
Vol. 3A 7-19
INTERRUPT AND EXCEPTION HANDLING
Interrupt/Trap Gate
31 0
Reserved 12
31 0
Offset 63..32 8
31 16 15 14 13 12 11 8 7 5 4 2 0
D
Offset 31..16 P P 0 TYPE 0 0 0 0 0 IST 4
L
31 16 15 0
In 64-bit mode, the IDT index is formed by scaling the interrupt vector by 16. The first eight bytes (bytes 7:0) of a
64-bit mode interrupt gate are similar but not identical to legacy 32-bit interrupt gates. The type field (bits 11:8 in
bytes 7:4) is described in Table 3-2. The Interrupt Stack Table (IST) field (bits 4:0 in bytes 7:4) is used by the stack
switching mechanisms described in Section 7.14.5, “Interrupt Stack Table.” Bytes 11:8 hold the upper 32 bits of
the target RIP (interrupt segment offset) in canonical form. A general-protection exception (#GP) is generated if
software attempts to reference an interrupt gate with a target RIP that is not in canonical form.
The target code segment referenced by the interrupt gate must be a 64-bit code segment (CS.L = 1, CS.D = 0). If
the target is not a 64-bit code segment, a general-protection exception (#GP) is generated with the IDT vector
number reported as the error code.
Only 64-bit interrupt and trap gates can be referenced in IA-32e mode (64-bit mode and compatibility mode).
Legacy 32-bit interrupt or trap gate types (0EH or 0FH) are redefined in IA-32e mode as 64-bit interrupt and trap
gate types. No 32-bit interrupt or trap gate type exists in IA-32e mode. If a reference is made to a 16-bit interrupt
or trap gate (06H or 07H), a general-protection exception (#GP(0)) is generated.
7-20 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Aligning the stack permits exception and interrupt frames to be aligned on a 16-byte boundary before interrupts
are re-enabled. This allows the stack to be formatted for optimal storage of 16-byte XMM registers, which enables
the interrupt handler to use faster 16-byte aligned loads and stores (MOVAPS rather than MOVUPS) to save and
restore XMM registers.
Although the RSP alignment is always performed when LMA = 1, it is only of consequence for the kernel-mode case
where there is no stack switch or IST used. For a stack switch or IST, the OS would have presumably put suitably
aligned RSP values in the TSS.
Vol. 3A 7-21
INTERRUPT AND EXCEPTION HANDLING
+20 SS SS +40
+16 ESP RSP +32
+12 EFLAGS RFLAGS +24
+8 CS CS +16
+4 EIP RIP +8
0 Error Code Stack Pointer After Error Code 0
Transfer to Handler
Figure 7-9. IA-32e Mode Stack Usage After Privilege Level Change
7-22 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
IST7 SSP
56
IST6 SSP
48
IST5 SSP
40
IST4 SSP
32
IST3 SSP
24
IST2 SSP
16
IST1 SSP
8
Not used; available
0
IA32_INTERRUPT_SSP_TABLE
Vol. 3A 7-23
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates the divisor operand for a DIV or IDIV instruction is 0 or that the result cannot be represented in the
number of bits specified for the destination operand.
7-24 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Exception Class Trap or Fault. The exception handler can distinguish between traps or faults by exam-
ining the contents of DR6 and the other debug registers.
Description
Indicates that one or more of several debug-exception conditions has been detected. Whether the exception is a
fault or a trap depends on the condition (see Table 7-3). See Chapter 19, “Debug, Branch Profile, TSC, and Intel®
Resource Director Technology (Intel® RDT) Features,” for detailed information about the debug exceptions.
Vol. 3A 7-25
INTERRUPT AND EXCEPTION HANDLING
and then delivers a #DB. See Section 17.3.7, “RTM-Enabled Debugger Support,” of Intel® 64 and IA-32 Architec-
tures Software Developer’s Manual, Volume 1.
7-26 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
The nonmaskable interrupt (NMI) is generated externally by asserting the processor’s NMI pin or through an NMI
request set by the I/O APIC to the local APIC. This interrupt causes the NMI interrupt handler to be called.
Vol. 3A 7-27
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that a breakpoint instruction (INT3, opcode CC) was executed, causing a breakpoint trap to be gener-
ated. Typically, a debugger sets a breakpoint by replacing the first opcode byte of an instruction with the opcode for
the INT3 instruction. (The INT3 instruction is one byte long, which makes it easy to replace an opcode in a code
segment in RAM with the breakpoint opcode.) The operating system or a debugging tool can use a data segment
mapped to the same physical address space as the code segment to place an INT3 instruction in places where it is
desired to call the debugger.
With the P6 family, Pentium, Intel486, and Intel386 processors, it is more convenient to set breakpoints with the
debug registers. (See Section 19.3.2, “Breakpoint Exception (#BP)—Interrupt Vector 3,” for information about the
breakpoint exception.) If more breakpoints are needed beyond what the debug registers allow, the INT3 instruction
can be used.
Any breakpoint exception inside an RTM region causes a transactional abort and, by default, redirects control flow
to the fallback instruction address. If advanced debugging of RTM transactional regions has been enabled, any
transactional abort due to a break exception instead causes execution to roll back to just before the XBEGIN
instruction and then delivers a debug exception (#DB) — not a breakpoint exception. See Section 17.3.7, “RTM-
Enabled Debugger Support,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.
A breakpoint exception can also be generated by executing the INT n instruction with an operand of 3. The action
of this instruction (INT 3) is slightly different than that of the INT3 instruction (see “INT n/INTO/INT3/INT1—Call to
Interrupt Procedure” in Chapter 3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume
2A).
7-28 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that an overflow trap occurred when an INTO instruction was executed. The INTO instruction checks the
state of the OF flag in the EFLAGS register. If the OF flag is set, an overflow trap is generated.
Some arithmetic instructions (such as the ADD and SUB) perform both signed and unsigned arithmetic. These
instructions set the OF and CF flags in the EFLAGS register to indicate signed overflow and unsigned overflow,
respectively. When performing arithmetic on signed operands, the OF flag can be tested directly or the INTO
instruction can be used. The benefit of using the INTO instruction is that if the overflow exception is detected, an
exception handler can be called automatically to handle the overflow condition.
Vol. 3A 7-29
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that a BOUND-range-exceeded fault occurred when a BOUND instruction was executed. The BOUND
instruction checks that a signed array index is within the upper and lower bounds of an array located in memory. If
the array index is not within the bounds of the array, a BOUND-range-exceeded fault is generated.
7-30 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the processor did one of the following things:
• Attempted to execute an invalid or reserved opcode.
• Attempted to execute an instruction with an operand type that is invalid for its accompanying opcode; for
example, the source operand for a LES instruction is not a memory location.
• Attempted to execute an MMX or SSE/SSE2/SSE3 instruction on an Intel 64 or IA-32 processor that does not
support the MMX technology or SSE/SSE2/SSE3/SSSE3 extensions, respectively. CPUID feature flags MMX (bit
23), SSE (bit 25), SSE2 (bit 26), SSE3 (ECX, bit 0), SSSE3 (ECX, bit 9) indicate support for these extensions.
• Attempted to execute an MMX instruction or SSE/SSE2/SSE3/SSSE3 SIMD instruction (with the exception of
the MOVNTI, PAUSE, PREFETCHh, SFENCE, LFENCE, MFENCE, CLFLUSH, MONITOR, and MWAIT instructions)
when the EM flag in control register CR0 is set (1).
• Attempted to execute an SSE/SE2/SSE3/SSSE3 instruction when the OSFXSR bit in control register CR4 is
clear (0). Note this does not include the following SSE/SSE2/SSE3 instructions: MASKMOVQ, MOVNTQ,
MOVNTI, PREFETCHh, SFENCE, LFENCE, MFENCE, and CLFLUSH; or the 64-bit versions of the PAVGB, PAVGW,
PEXTRW, PINSRW, PMAXSW, PMAXUB, PMINSW, PMINUB, PMOVMSKB, PMULHUW, PSADBW, PSHUFW, PADDQ,
PSUBQ, PALIGNR, PABSB, PABSD, PABSW, PHADDD, PHADDSW, PHADDW, PHSUBD, PHSUBSW, PHSUBW,
PMADDUBSM, PMULHRSW, PSHUFB, PSIGNB, PSIGND, and PSIGNW.
• Attempted to execute an SSE/SSE2/SSE3/SSSE3 instruction on an Intel 64 or IA-32 processor that caused a
SIMD floating-point exception when the OSXMMEXCPT bit in control register CR4 is clear (0).
• Executed a UD0, UD1 or UD2 instruction. Note that even though it is the execution of the UD0, UD1 or UD2
instruction that causes the invalid opcode exception, the saved instruction pointer will still points at the UD0,
UD1 or UD2 instruction.
• Detected a LOCK prefix that precedes an instruction that may not be locked or one that may be locked but the
destination operand is not a memory location.
• Attempted to execute an LLDT, SLDT, LTR, STR, LSL, LAR, VERR, VERW, or ARPL instruction while in real-
address or virtual-8086 mode.
• Attempted to execute the RSM instruction when not in SMM mode.
In Intel 64 and IA-32 processors that implement out-of-order execution microarchitectures, this exception is not
generated until an attempt is made to retire the result of executing an invalid instruction; that is, decoding and
speculatively attempting to execute an invalid opcode does not generate this exception. Likewise, in the Pentium
processor and earlier IA-32 processors, this exception is not generated as the result of prefetching and preliminary
decoding of an invalid instruction. (See Section 7.5, “Exception Classifications,” for general rules for taking of inter-
rupts and exceptions.)
The opcodes D6 and F1 are undefined opcodes reserved by the Intel 64 and IA-32 architectures. These opcodes,
even though undefined, do not generate an invalid opcode exception.
Vol. 3A 7-31
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates one of the following things:
The device-not-available exception is generated by either of three conditions:
• The processor executed an x87 FPU floating-point instruction while the EM flag in control register CR0 was set
(1). See the paragraph below for the special case of the WAIT/FWAIT instruction.
• The processor executed a WAIT/FWAIT instruction while the MP and TS flags of register CR0 were set,
regardless of the setting of the EM flag.
• The processor executed an x87 FPU, MMX, or SSE/SSE2/SSE3 instruction (with the exception of MOVNTI,
PAUSE, PREFETCHh, SFENCE, LFENCE, MFENCE, and CLFLUSH) while the TS flag in control register CR0 was set
and the EM flag is clear.
The EM flag is set when the processor does not have an internal x87 FPU floating-point unit. A device-not-available
exception is then generated each time an x87 FPU floating-point instruction is encountered, allowing an exception
handler to call floating-point instruction emulation routines.
The TS flag indicates that a context switch (task switch) has occurred since the last time an x87 floating-point,
MMX, or SSE/SSE2/SSE3 instruction was executed; but that the context of the x87 FPU, XMM, and MXCSR registers
were not saved. When the TS flag is set and the EM flag is clear, the processor generates a device-not-available
exception each time an x87 floating-point, MMX, or SSE/SSE2/SSE3 instruction is encountered (with the exception
of the instructions listed above). The exception handler can then save the context of the x87 FPU, XMM, and MXCSR
registers before it executes the instruction. See Section 2.5, “Control Registers,” for more information about the TS
flag.
The MP flag in control register CR0 is used along with the TS flag to determine if WAIT or FWAIT instructions should
generate a device-not-available exception. It extends the function of the TS flag to the WAIT and FWAIT instruc-
tions, giving the exception handler an opportunity to save the context of the x87 FPU before the WAIT or FWAIT
instruction is executed. The MP flag is provided primarily for use with the Intel 286 and Intel386 DX processors. For
programs running on the Pentium 4, Intel Xeon, P6 family, Pentium, or Intel486 DX processors, or the Intel 487 SX
coprocessors, the MP flag should always be set; for programs running on the Intel486 SX processor, the MP flag
should be clear.
7-32 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the processor detected a second exception while calling an exception handler for a prior exception.
Normally, when the processor detects another exception while trying to call an exception handler, the two excep-
tions can be handled serially. If, however, the processor cannot handle them serially, it signals the double-fault
exception. To determine when two faults need to be signalled as a double fault, the processor divides the excep-
tions into three classes: benign exceptions, contributory exceptions, and page faults (see Table 7-4).
Table 7-5 shows the various combinations of exception classes that cause a double fault to be generated. A double-
fault exception falls in the abort class of exceptions. The program or task cannot be restarted or resumed. The
double-fault handler can be used to collect diagnostic information about the state of the machine and/or, when
possible, to shut the application and/or system down gracefully or restart the system.
Vol. 3A 7-33
INTERRUPT AND EXCEPTION HANDLING
A segment or page fault may be encountered while prefetching instructions; however, this behavior is outside the
domain of Table 7-5. Any further faults generated while the processor is attempting to transfer control to the appro-
priate fault handler could still lead to a double-fault sequence.
Contributory Handle Exceptions Serially Generate a Double Fault Handle Exceptions Serially
Page Fault Handle Exceptions Serially Generate a Double Fault Generate a Double Fault
Double Fault Handle Exceptions Serially Enter Shutdown Mode Enter Shutdown Mode
If another contributory or page fault exception occurs while attempting to call the double-fault handler, the
processor enters shutdown mode. This mode is similar to the state following execution of an HLT instruction. In this
mode, the processor stops executing instructions until an NMI interrupt, SMI interrupt, hardware reset, or INIT# is
received. The processor generates a special bus cycle to indicate that it has entered shutdown mode. Software
designers may need to be aware of the response of hardware when it goes into shutdown mode. For example, hard-
ware may turn on an indicator light on the front panel, generate an NMI interrupt to record diagnostic information,
invoke reset initialization, generate an INIT initialization, or generate an SMI. If any events are pending during
shutdown, they will be handled after an wake event from shutdown is processed (for example, A20M# interrupts).
If a shutdown occurs while the processor is executing an NMI interrupt handler, then only a hardware reset can
restart the processor. Likewise, if the shutdown occurs while executing in SMM, a hardware reset must be used to
restart the processor.
7-34 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Exception Class Abort. (Intel reserved; do not use. Recent IA-32 processors do not generate this
exception.)
Description
Indicates that an Intel386 CPU-based systems with an Intel 387 math coprocessor detected a page or segment
violation while transferring the middle portion of an Intel 387 math coprocessor operand. The P6 family, Pentium,
and Intel486 processors do not generate this exception; instead, this condition is detected with a general protec-
tion exception (#GP), interrupt 13.
Vol. 3A 7-35
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that there was an error related to a TSS. Such an error might be detected during a task switch or during
the execution of instructions that use information from a TSS. Table 7-6 shows the conditions that cause an invalid
TSS exception to be generated.
7-36 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
This exception can generated either in the context of the original task or in the context of the new task (see Section
9.3, “Task Switching”). Until the processor has completely verified the presence of the new TSS, the exception is
generated in the context of the original task. Once the existence of the new TSS is verified, the task switch is
considered complete. Any invalid-TSS conditions detected after this point are handled in the context of the new
task. (A task switch is considered complete when the task register is loaded with the segment selector for the new
TSS and, if the switch is due to a procedure call or interrupt, the previous task link field of the new TSS references
the old TSS.)
The invalid-TSS handler must be a task called using a task gate. Handling this exception inside the faulting TSS
context is not recommended because the processor state may not be consistent.
Vol. 3A 7-37
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the present flag of a segment or gate descriptor is clear. The processor can generate this exception
during any of the following operations:
• While attempting to load CS, DS, ES, FS, or GS registers. [Detection of a not-present segment while loading the
SS register causes a stack fault exception (#SS) to be generated.] This situation can occur while performing a
task switch.
• While attempting to load the LDTR using an LLDT instruction. Detection of a not-present LDT while loading the
LDTR during a task switch operation causes an invalid-TSS exception (#TS) to be generated.
• When executing the LTR instruction and the TSS is marked not present.
• While attempting to use a gate descriptor or TSS that is marked segment-not-present, but is otherwise valid.
An operating system typically uses the segment-not-present exception to implement virtual memory at the
segment level. If the exception handler loads the segment and returns, the interrupted program or task resumes
execution.
A not-present indication in a gate descriptor, however, does not indicate that a segment is not present (because
gates do not correspond to segments). The operating system may use the present flag for gate descriptors to
trigger exceptions of special significance to the operating system.
A contributory exception or page fault that subsequently referenced a not-present segment would cause a double
fault (#DF) to be generated instead of #NP.
7-38 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
occurs. If it occurs after the commit point, the processor will load all the state information from the new TSS
(without performing any additional limit, present, or type checks) before it generates the exception. The segment-
not-present exception handler should not rely on being able to use the segment selectors found in the CS, SS, DS,
ES, FS, and GS registers without causing another exception. (See the Program State Change description for “Inter-
rupt 10—Invalid TSS Exception (#TS)” in this chapter for additional information on how to handle this situation.)
Vol. 3A 7-39
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that one of the following stack related conditions was detected:
• A limit violation is detected during an operation that refers to the SS register. Operations that can cause a limit
violation include stack-oriented instructions such as POP, PUSH, CALL, RET, IRET, ENTER, and LEAVE, as well as
other memory references which implicitly or explicitly use the SS register (for example, MOV AX, [BP+6] or
MOV AX, SS:[EAX+6]). The ENTER instruction generates this exception when there is not enough stack space
for allocating local variables.
• A not-present stack segment is detected when attempting to load the SS register. This violation can occur
during the execution of a task switch, a CALL instruction to a different privilege level, a return to a different
privilege level, an LSS instruction, or a MOV or POP instruction to the SS register.
• A canonical violation is detected in 64-bit mode during an operation that reference memory using the stack
pointer register containing a non-canonical memory address.
Recovery from this fault is possible by either extending the limit of the stack segment (in the case of a limit viola-
tion) or loading the missing stack segment into memory (in the case of a not-present violation.
In the case of a canonical violation that was caused intentionally by software, recovery is possible by loading the
correct canonical value into RSP. Otherwise, a canonical violation of the address in RSP likely reflects some register
corruption in the software.
7-40 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the processor detected one of a class of protection violations called “general-protection violations.”
The conditions that cause this exception to be generated comprise all the protection violations that do not cause
other exceptions to be generated (such as, invalid-TSS, segment-not-present, stack-fault, or page-fault excep-
tions). The following conditions cause general-protection exceptions to be generated:
• Exceeding the segment limit when accessing the CS, DS, ES, FS, or GS segments.
• Exceeding the segment limit when referencing a descriptor table (except during a task switch or a stack
switch).
• Transferring execution to a segment that is not executable.
• Writing to a code segment or a read-only data segment.
• Reading from an execute-only code segment.
• Loading the SS register with a segment selector for a read-only segment (unless the selector comes from a TSS
during a task switch, in which case an invalid-TSS exception occurs).
• Loading the SS, DS, ES, FS, or GS register with a segment selector for a system segment.
• Loading the DS, ES, FS, or GS register with a segment selector for an execute-only code segment.
• Loading the SS register with the segment selector of an executable segment or a null segment selector.
• Loading the CS register with a segment selector for a data segment or a null segment selector.
• Accessing memory using the DS, ES, FS, or GS register when it contains a null segment selector.
• Switching to a busy task during a call or jump to a TSS.
• Using a segment selector on a non-IRET task switch that points to a TSS descriptor in the current LDT. TSS
descriptors can only reside in the GDT. This condition causes a #TS exception during an IRET task switch.
• Violating any of the privilege rules described in Chapter 6, “Protection.”
• Exceeding the instruction length limit of 15 bytes (this only can occur when redundant prefixes are placed
before an instruction).
• Loading the CR0 register with a set PG flag (paging enabled) and a clear PE flag (protection disabled).
• Loading the CR0 register with a set NW flag and a clear CD flag.
• Referencing an entry in the IDT (following an interrupt or exception) that is not an interrupt, trap, or task gate.
• Attempting to access an interrupt or exception handler through an interrupt or trap gate from virtual-8086
mode when the handler’s code segment DPL is greater than 0.
• Attempting to write a 1 into a reserved bit of CR4.
• Attempting to execute a privileged instruction when the CPL is not equal to 0 (see Section 6.9, “Privileged
Instructions,” for a list of privileged instructions).
• Attempting to execute SGDT, SIDT, SLDT, SMSW, or STR when CR4.UMIP = 1 and the CPL is not equal to 0.
• Writing to a reserved bit in an MSR.
• Accessing a gate that contains a null segment selector.
• Executing the INT n instruction when the CPL is greater than the DPL of the referenced interrupt, trap, or task
gate.
• The segment selector in a call, interrupt, or trap gate does not point to a code segment.
• The segment selector operand in the LLDT instruction is a local type (TI flag is set) or does not point to a
segment descriptor of the LDT type.
• The segment selector operand in the LTR instruction is local or points to a TSS that is not available.
• The target code-segment selector for a call, jump, or return is null.
Vol. 3A 7-41
INTERRUPT AND EXCEPTION HANDLING
• If the PAE and/or PSE flag in control register CR4 is set and the processor detects any reserved bits in a page-
directory-pointer-table entry set to 1. These bits are checked during a write to control registers CR0, CR3, or
CR4 that causes a reloading of the page-directory-pointer-table entry.
• Attempting to write a non-zero value into the reserved bits of the MXCSR register.
• Executing an SSE/SSE2/SSE3 instruction that attempts to access a 128-bit memory location that is not aligned
on a 16-byte boundary when the instruction requires 16-byte alignment. This condition also applies to the stack
segment.
A program or task can be restarted following any general-protection exception. If the exception occurs while
attempting to call an interrupt handler, the interrupted program can be restartable, but the interrupt may be lost.
7-42 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
• If the segment descriptor pointed to by the segment selector in the destination operand is a code segment and
it has both the D-bit and the L-bit set.
• If the segment descriptor from a 64-bit call gate is in non-canonical space.
• If the DPL from a 64-bit call-gate is less than the CPL or than the RPL of the 64-bit call-gate.
• If the type field of the upper 64 bits of a 64-bit call gate is not 0.
• If an attempt is made to load a null selector in the SS register in compatibility mode.
• If an attempt is made to load null selector in the SS register in CPL3 and 64-bit mode.
• If an attempt is made to load a null selector in the SS register in non-CPL3 and 64-bit mode where RPL is not
equal to CPL.
• If an attempt is made to clear CR0.PG while IA-32e mode is enabled.
• If an attempt is made to set a reserved bit in CR3, CR4 or CR8.
Vol. 3A 7-43
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that, with paging enabled (the PG flag in the CR0 register is set), the processor detected one of the
following conditions while using the page-translation mechanism to translate a linear address to a physical
address:
• The P (present) flag in a page-directory or page-table entry needed for the address translation is clear,
indicating that a page table or the page containing the operand is not present in physical memory.
• The procedure does not have sufficient privilege to access the indicated page (that is, a procedure running in
user mode attempts to access a supervisor-mode page). If the SMAP flag is set in CR4, a page fault may also
be triggered by code running in supervisor mode that tries to access data at a user-mode address. If either the
PKE flag or the PKS flag is set in CR4, the protection-key rights registers may cause page faults on data
accesses to linear addresses with certain protection keys.
• Code running in user mode attempts to write to a read-only page. If the WP flag is set in CR0, the page fault
will also be triggered by code running in supervisor mode that tries to write to a read-only page.
• An instruction fetch to a linear address that translates to a physical address in a memory page with the
execute-disable bit set (for information about the execute-disable bit, see Chapter 5, “Paging”). If the SMEP
flag is set in CR4, a page fault will also be triggered by code running in supervisor mode that tries to fetch an
instruction from a user-mode address.
• One or more reserved bits in paging-structure entry are set to 1. See description below of RSVD error code flag.
• A shadow-stack access is made to a page that is not a shadow-stack page. See Section 18.2, “Shadow Stacks,”
in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, and Section 5.6, “Access
Rights.”
• An enclave access violates one of the specified access-control requirements. See Section 36.3, “Access-control
Requirements,” and Section 36.20, “Enclave Page Cache Map (EPCM),” in Chapter 36, “Enclave Access Control
and Data Structures.” In this case, the exception is called an SGX-induced page fault. The processor uses the
error code (below) to distinguish SGX-induced page faults from ordinary page faults.
The exception handler can recover from page-not-present conditions and restart the program or task without any
loss of program continuity. It can also restart the program or task after a privilege violation, but the problem that
caused the privilege violation may be uncorrectable.
See also: Section 5.7, “Page-Fault Exceptions.”
7-44 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Vol. 3A 7-45
INTERRUPT AND EXCEPTION HANDLING
31 15 7 6 5 4 3 2 1 0
SS
PK
I/D
RSVD
U/S
W/R
P
SGX
HLAT
Reserved Reserved
HLAT 0 The fault occurred during ordinary paging or due to access rights.
1 The fault occurred during HLAT paging.
• The contents of the CR2 register. The processor loads the CR2 register with the linear address that generated
the exception. If linear-address masking had been in effect (Section 4.4), the address recorded reflects the
result of that masking and does not contain any masked metadata. If the page-fault exception occurred during
execution of an instruction in enclave mode (and not during delivery of an event incident to enclave mode), bits
11:0 of the address are cleared.
The page-fault handler can use this address to locate the corresponding paging-structure entries. Another page
fault can potentially occur during execution of the page-fault handler; the handler should save the contents of
the CR2 register before a second page fault can occur.1 If a page fault is caused by a page-level protection
violation, the accessed flags in paging-structure entries may be set when the fault occurs (behavior is model-
specific and not architecturally defined).
1. Processors update CR2 whenever a page fault is detected. If a second page fault occurs while an earlier page fault is being deliv-
ered, the faulting linear address of the second fault will overwrite the contents of CR2 (replacing the previous address). These
updates to CR2 occur even if the page fault results in a double fault or occurs during the delivery of a double fault.
7-46 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Vol. 3A 7-47
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the x87 FPU has detected a floating-point error. The NE flag in the register CR0 must be set for an
interrupt 16 (floating-point error exception) to be generated. (See Section 2.5, “Control Registers,” for a detailed
description of the NE flag.)
NOTE
SIMD floating-point exceptions (#XM) are signaled through interrupt 19.
While executing x87 FPU instructions, the x87 FPU detects and reports six types of floating-point error conditions:
• Invalid operation (#I)
— Stack overflow or underflow (#IS)
— Invalid arithmetic operation (#IA)
• Divide-by-zero (#Z)
• Denormalized operand (#D)
• Numeric overflow (#O)
• Numeric underflow (#U)
• Inexact result (precision) (#P)
Each of these error conditions represents an x87 FPU exception type, and for each of exception type, the x87 FPU
provides a flag in the x87 FPU status register and a mask bit in the x87 FPU control register. If the x87 FPU detects
a floating-point error and the mask bit for the exception type is set, the x87 FPU handles the exception automati-
cally by generating a predefined (default) response and continuing program execution. The default responses have
been designed to provide a reasonable result for most floating-point applications.
If the mask for the exception is clear and the NE flag in register CR0 is set, the x87 FPU does the following:
1. Sets the necessary flag in the FPU status register.
2. Waits until the next “waiting” x87 FPU instruction or WAIT/FWAIT instruction is encountered in the program’s
instruction stream.
3. Generates an internal error signal that cause the processor to generate a floating-point exception (#MF).
Prior to executing a waiting x87 FPU instruction or the WAIT/FWAIT instruction, the x87 FPU checks for pending x87
FPU floating-point exceptions (as described in step 2 above). Pending x87 FPU floating-point exceptions are
ignored for “non-waiting” x87 FPU instructions, which include the FNINIT, FNCLEX, FNSTSW, FNSTSW AX, FNSTCW,
FNSTENV, and FNSAVE instructions. Pending x87 FPU exceptions are also ignored when executing the state
management instructions FXSAVE and FXRSTOR.
All of the x87 FPU floating-point error conditions can be recovered from. The x87 FPU floating-point-error exception
handler can determine the error condition that caused the exception from the settings of the flags in the x87 FPU
status word. See “Software Exception Handling” in Chapter 8 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 1, for more information on handling x87 FPU floating-point exceptions.
7-48 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
register. See Section 8.1.8, “x87 FPU Instruction and Data (Operand) Pointers,” in Chapter 8 of the Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume 1, for more information about information the FPU saves
for use in handling floating-point-error exceptions.
Vol. 3A 7-49
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the processor detected an unaligned memory operand when alignment checking was enabled. Align-
ment checks are only carried out in data (or stack) accesses (not in code fetches or system segment accesses). An
example of an alignment-check violation is a word stored at an odd byte address, or a doubleword stored at an
address that is not an integer multiple of 4. Table 7-7 lists the alignment requirements various data types recog-
nized by the processor.
Note that the alignment check exception (#AC) is generated only for data types that must be aligned on word,
doubleword, and quadword boundaries. A general-protection exception (#GP) is generated 128-bit data types that
are not aligned on a 16-byte boundary.
To enable alignment checking, the following conditions must be true:
• AM flag in CR0 register is set.
• AC flag in the EFLAGS register is set.
• The CPL is 3 (including virtual-8086 mode).
Alignment-check exceptions (#AC) are generated only when operating at privilege level 3 (user mode). Memory
references that default to privilege level 0, such as segment descriptor loads, do not generate alignment-check
exceptions, even when caused by a memory reference made from privilege level 3.
Storing the contents of the GDTR, IDTR, LDTR, or task register in memory while at privilege level 3 can generate
an alignment-check exception. Although application programs do not normally store these registers, the fault can
be avoided by aligning the information stored on an even word-address.
The FXSAVE/XSAVE and FXRSTOR/XRSTOR instructions save and restore a 512-byte data structure, the first byte
of which must be aligned on a 16-byte boundary. If the alignment-check exception (#AC) is enabled when
executing these instructions (and CPL is 3), a misaligned memory operand can cause either an alignment-check
exception or a general-protection exception (#GP) depending on the processor implementation (see “FXSAVE-Save
x87 FPU, MMX, SSE, and SSE2 State” and “FXRSTOR-Restore x87 FPU, MMX, SSE, and SSE2 State” in Chapter 3
7-50 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A; see “XSAVE—Save Processor
Extended States” and “XRSTOR—Restore Processor Extended States” in Chapter 6 of the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 2D).
The MOVDQU, MOVUPS, and MOVUPD instructions perform 128-bit unaligned loads or stores. The LDDQU instruc-
tions loads 128-bit unaligned data. They do not generate general-protection exceptions (#GP) when operands are
not aligned on a 16-byte boundary. If alignment checking is enabled, alignment-check exceptions (#AC) may or
may not be generated depending on processor implementation when data addresses are not aligned on an 8-byte
boundary.
FSAVE and FRSTOR instructions can generate unaligned references, which can cause alignment-check faults.
These instructions are rarely needed by application programs.
Vol. 3A 7-51
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the processor detected an internal machine error or a bus error, or that an external agent detected
a bus error. The machine-check exception is model-specific, available on the Pentium and later generations of
processors. The implementation of the machine-check exception is different between different processor families,
and these implementations may not be compatible with future Intel 64 or IA-32 processors. (Use the CPUID
instruction to determine whether this feature is present.)
Bus errors detected by external agents are signaled to the processor on dedicated pins: the BINIT# and MCERR#
pins on the Pentium 4, Intel Xeon, and P6 family processors and the BUSCHK# pin on the Pentium processor. When
one of these pins is enabled, asserting the pin causes error information to be loaded into machine-check registers
and a machine-check exception is generated.
The machine-check exception and machine-check architecture are discussed in detail in Chapter 17, “Machine-
Check Architecture.” Also, see the data books for the individual processors for processor-specific hardware infor-
mation.
7-52 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates the processor has detected an SSE/SSE2/SSE3 SIMD floating-point exception. The appropriate status
flag in the MXCSR register must be set and the particular exception unmasked for this interrupt to be generated.
There are six classes of numeric exception conditions that can occur while executing an SSE/ SSE2/SSE3 SIMD
floating-point instruction:
• Invalid operation (#I)
• Divide-by-zero (#Z)
• Denormal operand (#D)
• Numeric overflow (#O)
• Numeric underflow (#U)
• Inexact result (Precision) (#P)
The invalid operation, divide-by-zero, and denormal-operand exceptions are pre-computation exceptions; that is,
they are detected before any arithmetic operation occurs. The numeric underflow, numeric overflow, and inexact
result exceptions are post-computational exceptions.
See “SIMD Floating-Point Exceptions” in Chapter 11 of the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 1, for additional information about the SIMD floating-point exception classes.
When a SIMD floating-point exception occurs, the processor does either of the following things:
• It handles the exception automatically by producing the most reasonable result and allowing program
execution to continue undisturbed. This is the response to masked exceptions.
• It generates a SIMD floating-point exception, which in turn invokes a software exception handler. This is the
response to unmasked exceptions.
Each of the six SIMD floating-point exception conditions has a corresponding flag bit and mask bit in the MXCSR
register. If an exception is masked (the corresponding mask bit in the MXCSR register is set), the processor takes
an appropriate automatic default action and continues with the computation. If the exception is unmasked (the
corresponding mask bit is clear) and the operating system supports SIMD floating-point exceptions (the OSXM-
MEXCPT flag in control register CR4 is set), a software exception handler is invoked through a SIMD floating-point
exception. If the exception is unmasked and the OSXMMEXCPT bit is clear (indicating that the operating system
does not support unmasked SIMD floating-point exceptions), an invalid opcode exception (#UD) is signaled instead
of a SIMD floating-point exception.
Note that because SIMD floating-point exceptions are precise and occur immediately, the situation does not arise
where an x87 FPU instruction, a WAIT/FWAIT instruction, or another SSE/SSE2/SSE3 instruction will catch a
pending unmasked SIMD floating-point exception.
In situations where a SIMD floating-point exception occurred while the SIMD floating-point exceptions were
masked (causing the corresponding exception flag to be set) and the SIMD floating-point exception was subse-
quently unmasked, then no exception is generated when the exception is unmasked.
When SSE/SSE2/SSE3 SIMD floating-point instructions operate on packed operands (made up of two or four sub-
operands), multiple SIMD floating-point exception conditions may be detected. If no more than one exception
condition is detected for one or more sets of sub-operands, the exception flags are set for each exception condition
detected. For example, an invalid exception detected for one sub-operand will not prevent the reporting of a divide-
by-zero exception for another sub-operand. However, when two or more exceptions conditions are generated for
one sub-operand, only one exception condition is reported, according to the precedences shown in Table 7-8. This
exception precedence sometimes results in the higher priority exception condition being reported and the lower
priority exception conditions being ignored.
Vol. 3A 7-53
INTERRUPT AND EXCEPTION HANDLING
7-54 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the processor detected an EPT violation in VMX non-root operation. Not all EPT violations cause
virtualization exceptions. See Section 27.5.7.2 for details.
The exception handler can recover from EPT violations and restart the program or task without any loss of program
continuity. In some cases, however, the problem that caused the EPT violation may be uncorrectable.
Vol. 3A 7-55
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates a control flow transfer attempt violated the control flow enforcement technology constraints.
31 15 14 0
ENCL
CPEC
Reserved
7-56 Vol. 3A
INTERRUPT AND EXCEPTION HANDLING
In these cases the exception occurs in the context of the new task. The instruction pointer refers to the first instruc-
tion of the new task, not to the instruction which caused the task switch (or the last instruction to be executed, in
the case of an interrupt). If the design of the operating system permits control protection faults to occur during
task-switches, the control protection fault handler should be called through a task gate.
Vol. 3A 7-57
INTERRUPT AND EXCEPTION HANDLING
Description
Indicates that the processor did one of the following things:
• Executed an INT n instruction where the instruction operand is one of the vector numbers from 32 through 255.
• Responded to an interrupt request at the INTR pin or from the local APIC when the interrupt vector number
associated with the request is from 32 through 255.
7-58 Vol. 3A
16.Updates to Chapter 8, Volume 3A
Change bars and violet text show changes to Chapter 8 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.
------------------------------------------------------------------------------------------
Changes to this chapter:
• Added the new Section 8.7, “Flexible Updates of UIF.”
8.1 INTRODUCTION
This chapter provides details of an architectural feature called user interrupts.
This feature defines user interrupts as new events in the architecture. User interrupts are delivered to software
operating in 64-bit mode with CPL = 3 without any change to segmentation state. An individual user interrupt is
identified by a 6-bit user-interrupt vector, which is pushed on the stack as part of user-interrupt delivery. The
UIRET (user-interrupt return) instruction reverses user-interrupt delivery.
System software configures the user-interrupt architecture with MSRs. An operating system (OS) may update the
content of some of these MSRs when switching between OS-managed threads.
One of these MSRs references a data structure called the user posted-interrupt descriptor (UPID). User inter-
rupts for an OS-managed thread can be posted in the UPID associated with that thread. Such user interrupts will
be delivered after receipt of an ordinary interrupt (identified in the UPID) called a user-interrupt notification.1
System software can define operations to post user interrupts and to send user-interrupt notifications. In addition,
the user-interrupt feature defines the SENDUIPI instruction, by which application software can send interprocessor
user interrupts (user IPIs). An execution of SENDUIPI posts a user interrupt in a UPID and may send a user-inter-
rupt notification.
(Platforms may include mechanisms to process external interrupts as either ordinary interrupts or user interrupts.
Those processed as user interrupts would be posted in UPIDs and may result in user-interrupt notifications.
Specifics of such mechanisms are outside of the scope of this manual.)
Section 8.2 explains how a processor enumerates support for user interrupts and how they are enabled by system
software. Section 8.3 identifies the new processor state defined for user interrupts. Section 8.4 explains how a
processor identifies and delivers user interrupts. Section 8.5 describes how a processor identifies and processes
user-interrupt notifications. Section 8.6 enumerates new instructions that support management of user interrupts.
Section 8.8 defines new support for user inter-processor interrupts (user IPIs).
1. For clarity, this chapter uses the term ordinary interrupts to refer to those events in the existing interrupt architecture, which are
typically delivered to system software operating with CPL = 0.
Vol. 3A 8-1
USER INTERRUPTS
8-2 Vol. 3A
USER INTERRUPTS
Bit 0 of this MSR corresponds to UISTACKADJUST[0], which controls how user-interrupt delivery updates the
stack pointer. WRMSR may set it to either 0 or 1.
• IA32_UINTR_MISC MSR (MSR address 988H). This MSR is an interface to the UITTSZ and UINV values. The
MSR has the following format:
— Bits 31:0 are UITTSZ.
— Bits 39:32 are UINV.
— Bits 63:40 are reserved. WRMSR causes a #GP if it would set any of those bits (if
EDX[31:8] ≠ 000000H).
Because this MSR will share an 8-byte portion of the XSAVE area with UIF (see Section 13.5.11 of Intel® 64
and IA-32 Architectures Software Developer’s Manual, Volume 1), bit 63 of the MSR will never be used and
will always be reserved.
• IA32_UINTR_PD MSR (MSR address 989H). This MSR is an interface to the UPIDADDR address. This is a linear
address that must be canonical relative to the maximum linear-address width supported by the processor.
WRMSR to this MSR causes a #GP if its source operand does not meet this requirement.
Bits 5:0 of this MSR are reserved. WRMSR causes a #GP if it would set any of those bits (if
EAX[5:0] ≠ 000000b).
• IA32_UINTR_TT MSR (MSR address 98AH). This MSR is an interface to the UITTADDR address (in addition, bit
0 enables SENDUIPI).
Bit 63:4 of this MSR holds the current value of UITTADDR. This a linear address that must be canonical relative
to the maximum linear-address width supported by the processor. WRMSR to this MSR causes a #GP if its
source operand does not meet this requirement.
Bits 3:1 of this MSR are reserved. WRMSR causes a #GP if it would set any of those bits (if EAX[3:1] ≠ 000b).
Bit 0 of this MSR determines whether the SENDUIPI instruction is enabled. WRMSR may set it to either 0 or 1.
Vol. 3A 8-3
USER INTERRUPTS
1. Execution of the STI instruction does not block delivery of user interrupts for one instruction as it does ordinary interrupts. If a user
interrupt is delivered immediately following execution of a STI instruction, ordinary interrupts are not blocked after delivery of the
user interrupt.
2. User-interrupt delivery occurs only if CPL = 3. Since the HLT and MWAIT instructions can be executed only if CPL = 0, a user inter-
rupt can never be delivered when a logical processor is an activity state that was entered using one of those instructions.
8-4 Vol. 3A
USER INTERRUPTS
The stack accesses performed by user-interrupt delivery may incur faults (page faults, or stack faults due to
canonicality violations). Before such a fault is delivered, RSP is restored to its original value (memory locations
above the top of the stack may have been written). If such a fault produces an error code that uses the EXT bit,
that bit will be cleared to 0.
If a fault occurs during user-interrupt delivery, UIRR is not updated and UIF is not cleared and, as a result, the
logical processor continues to recognize that a user interrupt is pending, and user-interrupt delivery will normally
recur after the fault is handled.
If the shadow-stack feature of control-flow enforcement technology (CET) is enabled for CPL = 3, user-interrupt
delivery pushes the return instruction pointer on the shadow stack. If indirect-branch-tracking feature of CET is
enabled, user-interrupt delivery transitions the indirect branch tracker to the WAIT_FOR_ENDBRANCH state; an
ENDBR64 instruction is expected as first instruction of the user-interrupt handler.
User-interrupt delivery can be tracked by Architectural Last Branch Records (LBRs), Intel® Processor Trace (Intel®
PT), and Performance Monitoring. For both Intel PT and LBRs, user-interrupt delivery is recorded in precisely the
same manner as ordinary interrupt delivery. Hence for LBRs, user interrupts fall into the OTHER_BRANCH category,
which implies that IA32_LBR_CTL.OTHER_BRANCH[bit 22] must be set to record user-interrupt delivery, and that
the IA32_LBR_x_INFO.BR_TYPE field will indicate OTHER_BRANCH for any recorded user interrupt. For Intel PT,
control flow tracing must be enabled by setting IA32_RTIT_CTL.BranchEn[bit 13].
User-interrupt delivery will also increment performance counters for which counting
BR_INST_RETIRED.FAR_BRANCH is enabled. Some implementations may have dedicated events for counting
user-interrupt delivery; see processor-specific event lists at https://download.01.org/perfmon/index/.
The notation PIR (posted-interrupt requests) refers to the 64 posted-interrupt requests in a UPID.
If an ordinary interrupt arrives while CR4.UINTR = IA32_EFER.LMA = 1, the logical processor determines whether
the interrupt is a user-interrupt notification. This process is called user-interrupt notification identification
and is described in Section 8.5.1.
Once a logical processor has identified a user-interrupt notification, it copies user interrupts in the UPID’s PIR into
UIRR. This process is called user-interrupt notification processing and is described in Section 8.5.2.
Vol. 3A 8-5
USER INTERRUPTS
A logical processor is not interruptible during either user-interrupt notification identification or user-interrupt noti-
fication processing or between those operations (when they occur in succession).
1. If the interrupt arrives between iterations of a REP-prefixed string instruction, the processor first updates state as follows: RIP is
loaded to reference the string instruction; RCX, RSI, and RDI are updated as appropriate to reflect the iterations completed; and
RFLAGS.RF is set to 1.
8-6 Vol. 3A
USER INTERRUPTS
page fault occurs and that the linear address in the IA32_UINTR_PD MSR is canonical with respect to the paging
mode in use).
If the user-interrupt notification identification that precedes user-interrupt notification processing occurred due to
an ordinary interrupt that arrived while the logical processor was in the HLT state, the logical processor returns to
the HLT state following user-interrupt notification processing.
Vol. 3A 8-7
USER INTERRUPTS
UITTSZ+1 16-byte entries (the values UITTADDR and UITTSZ are defined in Section 8.3.1). SENDUIPI uses the
UITT entry (UITTE) indexed by the instruction’s register operand. Each UITTE has the following format:
• Bit 0: V, a valid bit.
• Bits 7:1 are reserved and must be 0.
• Bits 15:8: UV, the user-interrupt vector (in the range 0–63, so bits 15:14 must be 0).
• Bits 63:16 are reserved.
• Bits 127:64: UPIDADDR, the linear address of a UPID (64-byte aligned, so bits 69:64 must be 0).
SENDUIPI sends a user interrupt by posting a user interrupt with vector V in the UPID referenced by UPIDADDR and
then sending, as an ordinary IPI, any notification interrupt specified in that UPID. Details appear in Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B, 2C, & 2D.
8-8 Vol. 3A
17.Updates to Chapter 10, Volume 3A
Change bars and violet text show changes to Chapter 10 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3A: System Programming Guide, Part 1.
------------------------------------------------------------------------------------------
Changes to this chapter:
• Updated Section 10.1.2.3, “Features to Disable Bus Locks,” with additional information on UC-lock disable.
The Intel 64 and IA-32 architectures provide mechanisms for managing and improving the performance of multiple
processors connected to the same system bus. These include:
• Bus locking and/or cache coherency management for performing atomic operations on system memory.
• Serializing instructions.
• An advance programmable interrupt controller (APIC) located on the processor chip (see Chapter 12,
“Advanced Programmable Interrupt Controller (APIC)”). This feature was introduced by the Pentium processor.
• A second-level cache (level 2, L2). For the Pentium 4, Intel Xeon, and P6 family processors, the L2 cache is
included in the processor package and is tightly coupled to the processor. For the Pentium and Intel486
processors, pins are provided to support an external L2 cache.
• A third-level cache (level 3, L3). For Intel Xeon processors, the L3 cache is included in the processor package
and is tightly coupled to the processor.
• Intel Hyper-Threading Technology. This extension to the Intel 64 and IA-32 architectures enables a single
processor core to execute two or more threads concurrently (see Section 10.5, “Intel® Hyper-Threading
Technology and Intel® Multi-Core Technology”).
These mechanisms are particularly useful in symmetric-multiprocessing (SMP) systems. However, they can also be
used when an Intel 64 or IA-32 processor and a special-purpose processor (such as a communications, graphics,
or video processor) share the system bus.
These multiprocessing mechanisms have the following characteristics:
• To maintain system memory coherency — When two or more processors are attempting simultaneously to
access the same address in system memory, some communication mechanism or memory access protocol
must be available to promote data coherency and, in some instances, to allow one processor to temporarily lock
a memory location.
• To maintain cache consistency — When one processor accesses data cached on another processor, it must not
receive incorrect data. If it modifies data, all other processors that access that data must receive the modified
data.
• To allow predictable ordering of writes to memory — In some circumstances, it is important that memory writes
be observed externally in precisely the same order as programmed.
• To distribute interrupt handling among a group of processors — When several processors are operating in a
system in parallel, it is useful to have a centralized mechanism for receiving interrupts and distributing them to
available processors for servicing.
• To increase system performance by exploiting the multi-threaded and multi-process nature of contemporary
operating systems and applications.
The caching mechanism and cache consistency of Intel 64 and IA-32 processors are discussed in Chapter 13. The
APIC architecture is described in Chapter 12. Bus and memory locking, serializing instructions, memory ordering,
and Intel Hyper-Threading Technology are discussed in the following sections.
Vol. 3A 10-1
MULTIPLE-PROCESSOR MANAGEMENT
• Cache coherency protocols that ensure that atomic operations can be carried out on cached data structures
(cache lock); this mechanism is present in the Pentium 4, Intel Xeon, and P6 family processors.
These mechanisms are interdependent in the following ways. Certain basic memory transactions (such as reading
or writing a byte in system memory) are always guaranteed to be handled atomically. That is, once started, the
processor guarantees that the operation will be completed before another processor or bus agent is allowed access
to the memory location. The processor also supports bus locking for performing selected memory operations (such
as a read-modify-write operation in a shared area of memory) that typically need to be handled atomically, but are
not automatically handled this way. Because frequently used memory locations are often cached in a processor’s L1
or L2 caches, atomic operations can often be carried out inside a processor’s caches without asserting the bus lock.
Here the processor’s cache coherency protocols ensure that other processors that are caching the same memory
locations are managed properly while atomic operations are performed on cached memory locations.
NOTE
Where there are contested lock accesses, software may need to implement algorithms that ensure
fair access to resources in order to prevent lock starvation. The hardware provides no resource that
guarantees fairness to participating agents. It is the responsibility of software to manage the
fairness of semaphores and exclusive locking functions.
The mechanisms for handling locked atomic operations have evolved with the complexity of IA-32 processors. More
recent IA-32 processors (such as the Pentium 4, Intel Xeon, and P6 family processors) and Intel 64 provide a more
refined locking mechanism than earlier processors. These mechanisms are described in the following sections.
10-2 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
Except as noted above, an x87 instruction or an SSE instruction that accesses data larger than a quadword may be
implemented using multiple memory accesses. If such an instruction stores to memory, some of the accesses may
complete (writing to memory) while another causes the operation to fault for architectural reasons (e.g., due an
page-table entry that is marked “not present”). In this case, the effects of the completed accesses may be visible
to software even though the overall instruction caused a fault. If TLB invalidation has been delayed (see Section
5.10.4.4), such page faults may occur even if all accesses are to the same page.
1. The term “UC lock” is used because the most common situation regards accesses to UC memory. Despite the name, locked accesses
to WC, WP, and WT memory also cause bus locks.
Vol. 3A 10-3
MULTIPLE-PROCESSOR MANAGEMENT
NOTE
Do not implement semaphores using the WC memory type. Do not perform non-temporal stores to
a cache line containing a location used to implement a semaphore.
The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed
for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses
be aligned on their natural boundaries for better system performance:
• Any boundary for an 8-bit access (locked or otherwise).
• 16-bit boundary for locked word accesses.
• 32-bit boundary for locked doubleword accesses.
• 64-bit boundary for locked quadword accesses.
Locked operations are atomic with respect to all other memory operations and all externally visible events. Only
instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchro-
nize data written by one processor and read by another processor.
For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for
them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load
operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.
Locked instructions should not be used to ensure that data written can be fetched as instructions.
NOTE
The locked instructions for the current versions of the Pentium 4, Intel Xeon, P6 family, Pentium,
and Intel486 processors allow data written to be fetched as instructions. However, Intel
recommends that developers who require the use of self-modifying code use a different synchro-
nizing mechanism, described in the following sections.
10-4 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
A processor enumerates support for UC-lock disable either by setting bit 4 of the IA32_CORE_CAPABILITIES MSR
(MSR index CFH) or by enumerating CPUID.(EAX=07H, ECX=2):EDX[bit 6] as 1. The latter form of enumeration
(CPUID) is used beginning with processors based on Sierra Forest microarchitecture; earlier processors may use
the former form (IA32_CORE_CAPABILITIES).
NOTE
No processor will both set IA32_CORE_CAPABILITIES[4] and enumerate
CPUID.(EAX=07H, ECX=2):EDX[bit 6] as 1.
If a processor enumerates support for UC-lock disable (in either way), software can enable UC-lock disable by
setting MSR_MEMORY_CTRL[28]. When this bit is set, a locked access using a memory type other than WB causes
a fault. The locked access does not occur. The specific fault that occurs depends on how UC-lock disable is enumer-
ated:
• If IA32_CORE_CAPABILITIES[4] is read as 1, the UC lock results in a general-protection exception (#GP) with
a zero error code.
• If CPUID.(EAX=07H, ECX=2):EDX[bit 6] is enumerated as 1, the UC lock results in an #AC with an error code
with value 4.
UC-lock disable does not apply to locked accesses to physical addresses specified in a VMCS. Such accesses include
updates to accessed and dirty flags for EPT and those to posted-interrupt descriptors.
UC-lock disable is not enabled if CR0.CD = 1 or if MSR_PRMRR_BASE_0[2:0] ≠ 6 (WB) when PRMRRs are enabled.
If either of those conditions hold, the processor ignores the value of MSR_MEMORY_CTRL[28].
Note that the #AC(0) due to split-lock disable or alignment check is higher priority than a #GP(0) or #AC(4) due
to UC-lock disable. If both features are enabled, a locked access to multiple cache lines causes #AC(0) regardless
of the memory type(s) being accessed.
While MSR_MEMORY_CTRL is not an architectural MSR, the behavior described above is consistent across
processor models that enumerate the support in IA32_CORE_CAPABILITIES or CPUID.
In addition to these features that disable bus locks, there are features that allow software to detect when a bus lock
has occurred. See Section 19.3.1.6 for information about OS bus-lock detection and Section 27.2 for information
about the VMM bus-lock detection.
(* OPTION 1 *)
Store modified code (as data) into code segment;
Jump to new code or an intermediate location;
Execute new code;
(* OPTION 2 *)
Store modified code (as data) into code segment;
Execute a serializing instruction; (* For example, CPUID instruction *)
Execute new code;
1. Other alignment-check exceptions occur only if CR0.AM = 1, EFLAGS.AC = 1, and CPL = 3. The alignment-check exceptions resulting
from split-lock disable may occur even if CR0.AM = 0, EFLAGS.AC = 0, or CPL < 3.
Vol. 3A 10-5
MULTIPLE-PROCESSOR MANAGEMENT
The use of one of these options is not required for programs intended to run on the Pentium or Intel486 processors,
but are recommended to ensure compatibility with the P6 and more recent processor families.
Self-modifying code will execute at a lower level of performance than non-self-modifying or normal code. The
degree of the performance deterioration will depend upon the frequency of modification and specific characteristics
of the code.
The act of one processor writing data into the currently executing code segment of a second processor with the
intent of having the second processor execute that data as code is called cross-modifying code. As with self-
modifying code, IA-32 processors exhibit model-specific behavior when executing cross-modifying code,
depending upon how far ahead of the executing processors current execution pointer the code has been modified.
To write cross-modifying code and ensure that it is compliant with current and future versions of the IA-32 archi-
tecture, the following processor synchronization algorithm must be implemented:
10-6 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
Section 10.2.1 and Section 10.2.2 describe the memory-ordering implemented by Intel486, Pentium, Intel Core 2
Duo, Intel Atom, Intel Core Duo, Pentium 4, Intel Xeon, and P6 family processors. Section 10.2.3 gives examples
illustrating the behavior of the memory-ordering model on IA-32 and Intel-64 processors. Section 10.2.4 considers
the special treatment of stores for string operations and Section 10.2.5 discusses how memory-ordering behavior
may be modified through the use of specific instructions.
1. Earlier versions of this manual specified that writes to memory may be reordered with executions of the CLFLUSH instruction. No
processors implementing the CLFLUSH instruction allow such reordering.
Vol. 3A 10-7
MULTIPLE-PROCESSOR MANAGEMENT
• MFENCE instructions cannot pass earlier reads, writes, or executions of CLFLUSH and CLFLUSHOPT.
In a multiple-processor system, the following ordering principles apply:
• Individual processors use the same ordering principles as in a single-processor system.
• Writes by a single processor are observed in the same order by all processors.
• Writes from an individual processor are NOT ordered with respect to the writes from other processors.
• Memory ordering obeys causality (memory ordering respects transitive visibility).
• Any two stores are seen in a consistent order by processors other than those performing the stores
• Locked instructions have a total order.
See the example in Figure 10-1. Consider three processors in a system and each processor performs three writes,
one to each of three defined locations (A, B, and C). Individually, the processors perform the writes in the same
program order, but because of bus arbitration and other memory access mechanisms, the order that the three
processors write the individual memory locations can differ each time the respective code sequences are executed
on the processors. The final values in location A, B, and C would possibly vary on each execution of the write
sequence.
The processor-ordering model described in this section is virtually identical to that used by the Pentium and
Intel486 processors. The only enhancements in the Pentium 4, Intel Xeon, and P6 family processors are:
• Added support for speculative reads, while still adhering to the ordering principles above.
• Store-buffer forwarding, when a read passes a write to the same memory location.
• Out of order store from long string store and string move operations (see Section 10.2.4, “Fast-String
Operation and Out-of-Order Stores,” below).
NOTE
In P6 processor family, store-buffer forwarding to reads of WC memory from streaming stores to
the same address does not occur due to errata.
10-8 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
Vol. 3A 10-9
MULTIPLE-PROCESSOR MANAGEMENT
10.2.3.2 Neither Loads Nor Stores Are Reordered with Like Operations
The Intel-64 memory-ordering model allows neither loads nor stores to be reordered with the same kind of opera-
tion. That is, it ensures that loads are seen in program order and that stores are seen in program order. This is illus-
trated by the following example:
Example 10-1. Stores Are Not Reordered with Other Stores
Processor 0 Processor 1
mov [ _x], 1 mov r1, [ _y]
mov [ _y], 1 mov r2, [ _x]
Initially x = y = 0
r1 = 1 and r2 = 0 is not allowed
The disallowed return values could be exhibited only if processor 0’s two stores are reordered (with the two loads
occurring between them) or if processor 1’s two loads are reordered (with the two stores occurring between them).
If r1 = 1, the store to y occurs before the load from y. Because the Intel-64 memory-ordering model does not allow
stores to be reordered, the earlier store to x occurs before the load from y. Because the Intel-64 memory-ordering
model does not allow loads to be reordered, the store to x also occurs before the later load from x. This r2 = 1.
10-10 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
Assume r1 = 1.
• Because r1 = 1, processor 1’s store to x occurs before processor 0’s load from x.
• Because the Intel-64 memory-ordering model prevents each store from being reordered with the earlier load
by the same processor, processor 1’s load from y occurs before its store to x.
• Similarly, processor 0’s load from x occurs before its store to y.
• Thus, processor 1’s load from y occurs before processor 0’s store to y, implying r2 = 0.
At each processor, the load and the store are to different locations and hence may be reordered. Any interleaving
of the operations is thus allowed. One such interleaving has the two loads occurring before the two stores. This
would result in each load returning value 0.
The fact that a load may not be reordered with an earlier store to the same location is illustrated by the following
example:
Example 10-4. Loads Are not Reordered with Older Stores to the Same Location
Processor 0
mov [ _x], 1
mov r1, [ _x]
Initially x = 0
r1 = 0 is not allowed
The Intel-64 memory-ordering model does not allow the load to be reordered with the earlier store because the
accesses are to the same location. Therefore, r1 = 1 must hold.
Vol. 3A 10-11
MULTIPLE-PROCESSOR MANAGEMENT
The memory-ordering model imposes no constraints on the order in which the two stores appear to execute by the
two processors. This fact allows processor 0 to see its store before seeing processor 1's, while processor 1 sees its
store before seeing processor 0's. (Each processor is self consistent.) This allows r2 = 0 and r4 = 0.
In practice, the reordering in this example can arise as a result of store-buffer forwarding. While a store is tempo-
rarily held in a processor's store buffer, it can satisfy the processor's own loads but is not visible to (and cannot
satisfy) loads by other processors.
10-12 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
Initially x = y =0
r1 = 1, r2 = 0, r3 = 1, r4 = 0 is not allowed
Processor 2 and processor 3 must agree on the order of the two executions of XCHG. Without loss of generality,
suppose that processor 0’s XCHG occurs first.
• If r5 = 1, processor 1’s XCHG into y occurs before processor 3’s load from y.
• Because the Intel-64 memory-ordering model prevents loads from being reordered (see Section 10.2.3.2),
processor 3’s loads occur in order and, therefore, processor 1’s XCHG occurs before processor 3’s load from x.
• Since processor 0’s XCHG into x occurs before processor 1’s XCHG (by assumption), it occurs before
processor 3’s load from x. Thus, r6 = 1.
A similar argument (referring instead to processor 2’s loads) applies if processor 1’s XCHG occurs before
processor 0’s XCHG.
10.2.3.9 Loads and Stores Are Not Reordered with Locked Instructions
The memory-ordering model prevents loads and stores from being reordered with locked instructions that execute
earlier or later. The examples in this section illustrate only cases in which a locked instruction is executed before a
Vol. 3A 10-13
MULTIPLE-PROCESSOR MANAGEMENT
load or a store. The reader should note that reordering is prevented also if the locked instruction is executed after
a load or a store.
The first example illustrates that loads may not be reordered with earlier locked instructions:
Example 10-9. Loads Are not Reordered with Locks
Processor 0 Processor 1
xchg [ _x], r1 xchg [ _y], r3
mov r2, [ _y] mov r4, [ _x]
Initially x = y = 0, r1 = r3 = 1
r2 = 0 and r4 = 0 is not allowed
As explained in Section 10.2.3.8, there is a total order of the executions of locked instructions. Without loss of
generality, suppose that processor 0’s XCHG occurs first.
Because the Intel-64 memory-ordering model prevents processor 1’s load from being reordered with its earlier
XCHG, processor 0’s XCHG occurs before processor 1’s load. This implies r4 = 1.
A similar argument (referring instead to processor 2’s accesses) applies if processor 1’s XCHG occurs before
processor 0’s XCHG.
The second example illustrates that a store may not be reordered with an earlier locked instruction:
Example 10-10. Stores Are not Reordered with Locks
Processor 0 Processor 1
xchg [ _x], r1 mov r2, [ _y]
mov [ _y], 1 mov r3, [ _x]
Initially x = y = 0, r1 = 1
r2 = 1 and r3 = 0 is not allowed
Assume r2 = 1.
• Because r2 = 1, processor 0’s store to y occurs before processor 1’s load from y.
• Because the memory-ordering model prevents a store from being reordered with an earlier locked instruction,
processor 0’s XCHG into x occurs before its store to y. Thus, processor 0’s XCHG into x occurs before
processor 1’s load from y.
• Because the memory-ordering model prevents loads from being reordered (see Section 10.2.3.2),
processor 1’s loads occur in order and, therefore, processor 1’s XCHG into x occurs before processor 1’s load
from x. Thus, r3 = 1.
10-14 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
It is possible for processor 1 to perceive that the repeated string stores in processor 0 are happening out of order.
Assume that fast string operations are enabled on processor 0.
In Example 10-12, processor 0 does two separate rounds of rep stosd operation of 128 doubleword stores, writing
the value 1 (value in EAX) into the first block of 512 bytes from location _x (kept in ES:EDI) in ascending order. It
then writes 1 into a second block of memory from (_x+512) to (_x+1023). All of the memory locations initially
contain 0. The block of memory initially contained 0. Processor 1 performs two load operations from the two blocks
of memory.
Vol. 3A 10-15
MULTIPLE-PROCESSOR MANAGEMENT
It is not possible in the above example for processor 1 to perceive any of the stores from the later string operation
(to the second 512 block) in processor 0 before seeing the stores from the earlier string operation to the first 512
block.
The above example assumes that writes to the second block (_x+512 to _x+1023) does not get executed while
processor 0’s string operation to the first block has been interrupted. If the string operation to the first block by
processor 0 is interrupted, and a write to the second memory block is executed by the interrupt handler, then that
change in the second memory block will be visible before the string operation to the first memory block resumes.
In Example 10-13, processor 0 does one round of (128 iterations) doubleword string store operation via rep:stosd,
writing the value 1 (value in EAX) into a block of 512 bytes from location _x (kept in ES:EDI) in ascending order. It
then writes to a second memory location outside the memory block of the previous string operation. Processor 1
performs two read operations, the first read is from an address outside the 512-byte block but to be updated by
processor 0, the second ready is from inside the block of memory of string operation.
Example 10-13. String Operations Are not Reordered with later Stores
Processor 0 Processor 1
rep:stosd [ _x] mov r1, [ _z]
mov [_z], $1 mov r2, [ _y]
Initially on processor 0: EAX = 1, ECX=128, ES:EDI =_x
Initially [_y] = [_z] = 0, [_x] to 511[_x]= 0, _x <= _y < _x+512, _z is a separate memory location
r1 = 1 and r2 = 0 is not allowed
Processor 1 cannot perceive the later store by processor 0 until it sees all the stores from the string operation.
Example 10-13 assumes that processor 0’s store to [_z] is not executed while the string operation has been inter-
rupted. If the string operation is interrupted and the store to [_z] by processor 0 is executed by the interrupt
handler, then changes to [_z] will become visible before the string operation resumes.
Example 10-14 illustrates the visibility principle when a string operation is interrupted.
10-16 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
In Example 10-14, processor 0 started a string operation to write to a memory block of 512 bytes starting at
address _x. Processor 0 got interrupted after k iterations of store operations. The address _y has not yet been
updated by processor 0 when processor 0 got interrupted. The interrupt handler that took control on processor 0
writes to the address _z. Processor 1 may see the store to _z from the interrupt handler, before seeing the
remaining stores to the 512-byte memory block that are executed when the string operation resumes.
Example 10-15 illustrates the ordering of string operations with earlier stores. No store from a string operation can
be visible before all prior stores are visible.
Example 10-15. String Operations Are not Reordered with Earlier Stores
Processor 0 Processor 1
mov [_z], $1 mov r1, [ _y]
rep:stosd [ _x] mov r2, [ _z]
Initially on processor 0: EAX = 1, ECX=128, ES:EDI =_x
Initially [_y] = [_z] = 0, [_x] to 511[_x]= 0, _x <= _y < _x+512, _z is a separate memory location
r1 = 1 and r2 = 0 is not allowed
Vol. 3A 10-17
MULTIPLE-PROCESSOR MANAGEMENT
The SFENCE, LFENCE, and MFENCE instructions provide a performance-efficient way of ensuring load and store
memory ordering between routines that produce weakly-ordered results and routines that consume that data. The
functions of these instructions are as follows:
• SFENCE — Serializes all store (write) operations that occurred prior to the SFENCE instruction in the program
instruction stream, but does not affect load operations.
• LFENCE — Serializes all load (read) operations that occurred prior to the LFENCE instruction in the program
instruction stream, but does not affect store operations.1
• MFENCE — Serializes all store and load operations that occurred prior to the MFENCE instruction in the
program instruction stream.
Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory
ordering than the CPUID instruction.
The MTRRs were introduced in the P6 family processors to define the cache characteristics for specified areas of
physical memory. The following are two examples of how memory types set up with MTRRs can be used strengthen
or weaken memory ordering for the Pentium 4, Intel Xeon, and P6 family processors:
• The strong uncached (UC) memory type forces a strong-ordering model on memory accesses. Here, all reads
and writes to the UC memory region appear on the bus and out-of-order or speculative accesses are not
performed. This memory type can be applied to an address range dedicated to memory mapped I/O devices to
force strong memory ordering.
• For areas of memory where weak ordering is acceptable, the write back (WB) memory type can be chosen.
Here, reads can be performed speculatively and writes can be buffered and combined. For this type of memory,
cache locking is performed on atomic (locked) operations that do not split across cache lines, which helps to
reduce the performance penalty associated with the use of the typical synchronization instructions, such as
XCHG, that lock the bus during the entire read-modify-write operation. With the WB memory type, the XCHG
instruction locks the cache instead of the bus if the memory access is contained within a cache line.
The PAT was introduced in the Pentium III processor to enhance the caching characteristics that can be assigned to
pages or groups of pages. The PAT mechanism typically used to strengthen caching characteristics at the page level
with respect to the caching characteristics established by the MTRRs. Table 13-7 shows the interaction of the PAT
with the MTRRs.
Intel recommends that software written to run on Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, Intel
Xeon, and P6 family processors assume the processor-ordering model or a weaker memory-ordering model. The
Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, Intel Xeon, and P6 family processors do not implement a
strong memory-ordering model, except when using the UC memory type. Despite the fact that Pentium 4, Intel
Xeon, and P6 family processors support processor ordering, Intel does not guarantee that future processors will
support this model. To make software portable to future processors, it is recommended that operating systems
provide critical region and resource control constructs and API’s (application program interfaces) based on I/O,
locking, and/or serializing instructions be used to synchronize access to shared areas of memory in multiple-
processor systems. Also, software should not depend on processor ordering in situations where the system hard-
ware does not support this memory-ordering model.
1. Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution
until LFENCE completes. As a result, an instruction that loads from memory and that precedes an LFENCE receives data from mem-
ory prior to completion of the LFENCE. An LFENCE that follows an instruction that stores to memory might complete before the data
being stored have become globally visible. Instructions following an LFENCE may be fetched from memory before the LFENCE, but
they will not execute until the LFENCE completes.
10-18 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
operations that were started while the processor was in real-address mode are completed before the switch to
protected mode is made.
The concept of serializing instructions was introduced into the IA-32 architecture with the Pentium processor to
support parallel instruction execution. Serializing instructions have no meaning for the Intel486 and earlier proces-
sors that do not implement parallel instruction execution.
It is important to note that executing of serializing instructions on P6 and more recent processor families constrain
speculative execution because the results of speculatively executed instructions are discarded. The following
instructions are serializing instructions:
• Privileged serializing instructions — INVD, INVEPT, INVLPG, INVVPID, LGDT, LIDT, LLDT, LTR, MOV (to
control register, with the exception of MOV CR81), MOV (to debug register), WBINVD, and WRMSR2.
• Non-privileged serializing instructions — CPUID, IRET, RSM, and SERIALIZE.
When the processor serializes instruction execution, it ensures that all pending memory transactions are
completed (including writes stored in its store buffer) before it executes the next instruction. Nothing can pass a
serializing instruction and a serializing instruction cannot pass any other instruction (read, write, instruction fetch,
or I/O). For example, CPUID can be executed at any privilege level to serialize instruction execution with no effect
on program flow, except that the EAX, EBX, ECX, and EDX registers are modified.
The following instructions are memory-ordering instructions, not serializing instructions. These drain the data
memory subsystem. They do not serialize the instruction execution stream:3
• Non-privileged memory-ordering instructions — SFENCE, LFENCE, and MFENCE.
The SFENCE, LFENCE, and MFENCE instructions provide more granularity in controlling the serialization of memory
loads and stores (see Section 10.2.5, “Strengthening or Weakening the Memory-Ordering Model”).
The following additional information is worth noting regarding serializing instructions:
• The processor does not write back the contents of modified data in its data cache to external memory when it
serializes instruction execution. Software can force modified data to be written back by executing the WBINVD
instruction, which is a serializing instruction. The amount of time or cycles for WBINVD to complete will vary
due to the size of different cache hierarchies and other factors. As a consequence, the use of the WBINVD
instruction can have an impact on interrupt/event response time.
• When an instruction is executed that enables or disables paging (that is, changes the PG flag in control register
CR0), the instruction should be followed by a jump instruction. The target instruction of the jump instruction is
fetched with the new setting of the PG flag (that is, paging is enabled or disabled), but the jump instruction
itself is fetched with the previous setting. The Pentium 4, Intel Xeon, and P6 family processors do not require
the jump operation following the move to register CR0 (because any use of the MOV instruction in a Pentium 4,
Intel Xeon, or P6 family processor to write to CR0 is completely serializing). However, to maintain backwards
and forward compatibility with code written to run on other IA-32 processors, it is recommended that the jump
operation be performed.
• Whenever an instruction is executed to change the contents of CR3 while paging is enabled, the next
instruction is fetched using the translation tables that correspond to the new value of CR3. Therefore the next
instruction and the sequentially following instructions should have a mapping based upon the new value of
CR3. (Global entries in the TLBs are not invalidated, see Section 5.10.4, “Invalidation of TLBs and Paging-
Structure Caches.”)
• The Pentium processor and more recent processor families use branch-prediction techniques to improve
performance by prefetching the destination of a branch instruction before the branch instruction is executed.
Consequently, instruction execution is not deterministically serialized when a branch instruction is executed.
Vol. 3A 10-19
MULTIPLE-PROCESSOR MANAGEMENT
10-20 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
APIC ID). During boot-up, one of the logical processors is selected as the BSP and the remainder of the logical
processors are designated as APs.
Vol. 3A 10-21
MULTIPLE-PROCESSOR MANAGEMENT
message. It responds by fetching and executing the BIOS boot-strap code, beginning at the reset vector
(physical address FFFF FFF0H).
5. As part of the boot-strap code, the BSP creates an ACPI table and/or an MP table and adds its initial APIC ID to
these tables as appropriate.
6. At the end of the boot-strap procedure, the BSP sets a processor counter to 1, then broadcasts a SIPI message
to all the APs in the system. Here, the SIPI message contains a vector to the BIOS AP initialization code (at
000VV000H, where VV is the vector contained in the SIPI message).
7. The first action of the AP initialization code is to set up a race (among the APs) to a BIOS initialization
semaphore. The first AP to the semaphore begins executing the initialization code. (See Section 10.4.4, “MP
Initialization Example,” for semaphore implementation details.) As part of the AP initialization procedure, the
AP adds its APIC ID number to the ACPI and/or MP tables as appropriate and increments the processor counter
by 1. At the completion of the initialization procedure, the AP executes a CLI instruction and halts itself.
8. When each of the APs has gained access to the semaphore and executed the AP initialization code, the BSP
establishes a count for the number of processors connected to the system bus, completes executing the BIOS
boot-strap code, and then begins executing operating-system boot-strap and start-up code.
9. While the BSP is executing operating-system boot-strap and start-up code, the APs remain in the halted state.
In this state they will respond only to INITs, NMIs, and SMIs. They will also respond to snoops and to assertions
of the STPCLK# pin.
The following section gives an example (with code) of the MP initialization protocol for of multiple processors oper-
ating in an MP configuration.
Chapter 2, “Model-Specific Registers (MSRs)‚” in the Intel® 64 and IA-32 Architectures Software Developer’s
Manual, Volume 4, describes how to program the LINT[0:1] pins of the processor’s local APICs after an MP config-
uration has been completed.
10-22 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
5. Executes the CPUID instruction with a value of 0H in the EAX register, then reads the EBX, ECX, and EDX
registers to determine if the BSP is “GenuineIntel.”
6. Executes the CPUID instruction with a value of 1H in the EAX register, then saves the values in the EAX, ECX,
and EDX registers in a system configuration space in RAM for use later.
7. Loads start-up code for the AP to execute into a 4-KByte page in the lower 1 MByte of memory.
8. Switches to protected mode and ensures that the APIC address space is mapped to the strong uncacheable
(UC) memory type.
9. Determine the BSP’s APIC ID from the local APIC ID register (default is 0), the code snippet below is an
example that applies to logical processors in a system whose local APIC units operate in xAPIC mode that APIC
registers are accessed using memory mapped interface:
Vol. 3A 10-23
MULTIPLE-PROCESSOR MANAGEMENT
16. Reads and evaluates the COUNT variable and establishes a processor count.
17. If necessary, reconfigures the APIC and continues with the remaining system diagnostics as appropriate.
10-24 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
Vol. 3A 10-25
MULTIPLE-PROCESSOR MANAGEMENT
Reserved 0
Cluster
Processor ID
Reserved
Cluster
Processor ID
For P6 family processors, the APIC ID that is assigned to a processor during power-up and initialization is 4 bits
(see Figure 10-2). Here, bits 0 and 1 form a 2-bit processor (or socket) identifier and bits 2 and 3 form a 2-bit
cluster ID.
10-26 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
• Addressable IDs for processor cores in the same Package1 (CPUID.(EAX=4, ECX=02):EAX[31:26] +
1 = Y) — Indicates the maximum number of addressable IDs attributable to processor cores (Y) in the physical
package.
• Extended Processor Topology Enumeration parameters for 32-bit APIC ID: Intel 64 processors
supporting CPUID leaf 0BH will assign unique APIC IDs to each logical processor in the system. CPUID leaf 0BH
reports the 32-bit APIC ID and provide topology enumeration parameters. See CPUID instruction reference
pages in Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A.
The CPUID feature flag may indicate support for hardware multi-threading when only one logical processor avail-
able in the package. In this case, the decimal value represented by bits 16 through 23 in the EBX register will have
a value of 1.
Software should note that the number of logical processors enabled by system software may be less than the value
of “Addressable IDs for Logical processors”. Similarly, the number of cores enabled by system software may be less
than the value of “Addressable IDs for processor cores”.
Software can detect the availability of the CPUID extended topology enumeration leaf (0BH) by performing two
steps:
• Check maximum input value for basic CPUID information by executing CPUID with EAX= 0. If CPUID.0H:EAX is
greater than or equal or 11 (0BH), then proceed to next step,
• Check CPUID.EAX=0BH, ECX=0H:EBX is non-zero.
If both of the above conditions are true, extended topology enumeration leaf is available. Note the presence of
CPUID leaf 0BH in a processor does not guarantee support that the local APIC supports x2APIC. If
CPUID.(EAX=0BH, ECX=0H):EBX returns zero and maximum input value for basic CPUID information is greater
than 0BH, then CPUID.0BH leaf is not supported on that processor.
1. Operating system and BIOS may implement features that reduce the number of logical processors available in a platform to applica-
tions at runtime to less than the number of physical packages times the number of hardware-capable logical processors per pack-
age.
1. Software must check CPUID for its support of leaf 4 when implementing support for multi-core. If CPUID leaf 4 is not available at
runtime, software should handle the situation as if there is only one core per package.
2. Maximum number of cores in the physical package must be queried by executing CPUID with EAX=4 and a valid ECX input value.
Valid ECX input values start from 0.
Vol. 3A 10-27
MULTIPLE-PROCESSOR MANAGEMENT
Interrupt Interrupt
IPIs IPIs
Messages Messages
Interrupt Messages
Bridge
PCI
System Chipset
Figure 10-3. Local APICs and I/O APIC in MP System Supporting Intel HT Technology
10-28 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
consists of two logical processors (each represented by a separate architectural state) which share the processor’s
execution engine and the bus interface. Each logical processor also has its own advanced programmable interrupt
controller (APIC).
Logical Logical
Processor 0 Processor 1
Architectural Architectural
State State
Execution Engine
Bus Interface
System Bus
Figure 10-4. IA-32 Processor with Two Logical Processors Supporting Intel HT Technology
Vol. 3A 10-29
MULTIPLE-PROCESSOR MANAGEMENT
10-30 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
On Intel Atom family processors that support Intel Hyper-Threading Technology, the MCA facilities are shared
between all logical processors on the same processor core.
Vol. 3A 10-31
MULTIPLE-PROCESSOR MANAGEMENT
NOTE
Some processors (prior to the introduction of Intel 64 Architecture and based on Intel NetBurst
microarchitecture) do not support simultaneous loading of microcode update to the sibling logical
processors in the same core. All other processors support logical processors initiating an update
simultaneously. Intel recommends a common approach that the microcode loader use the
sequential technique described in Section 11.11.6.3.
10-32 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
When a logical processor performs a TLB invalidation operation, only the TLB entries that are tagged for that logical
processor are guaranteed to be flushed. This protocol applies to all TLB invalidation operations, including writes to
control registers CR3 and CR4 and uses of the INVLPG instruction.
In MP systems, the STPCLK# pins on all physical processors are generally tied together. As a result this signal
affects all the logical processors within the system simultaneously.
• LINT0 and LINT1 pins — A processor supporting Intel Hyper-Threading Technology has only one set of LINT0
and LINT1 pins, which are shared between the logical processors. When one of these pins is asserted, both
logical processors respond unless the pin has been masked in the APIC local vector tables for one or both of the
logical processors.
Typically in MP systems, the LINT0 and LINT1 pins are not used to deliver interrupts to the logical processors.
Instead all interrupts are delivered to the local processors through the I/O APIC.
• A20M# pin — On an IA-32 processor, the A20M# pin is typically provided for compatibility with the Intel 286
processor. Asserting this pin causes bit 20 of the physical address to be masked (forced to zero) for all external
bus memory accesses. Processors supporting Intel Hyper-Threading Technology provide one A20M# pin, which
affects the operation of both logical processors within the physical processor.
The functionality of A20M# is used primarily by older operating systems and not used by modern operating
systems. On newer Intel 64 processors, A20M# may be absent.
Vol. 3A 10-33
MULTIPLE-PROCESSOR MANAGEMENT
In general, each processor core has dedicated microarchitectural resources identical to a single-processor imple-
mentation of the underlying microarchitecture without hardware multi-threading capability. Each logical processor
in a dual-core processor (whether supporting Intel Hyper-Threading Technology or not) has its own APIC function-
ality, PAT, machine check architecture, debug registers and extensions. Each logical processor handles serialization
instructions or self-modifying code on its own. Memory order is handled the same way as in Intel Hyper-Threading
Technology.
The topology of the cache hierarchy (with respect to whether a given cache level is shared by one or more
processor cores or by all logical processors in the physical package) depends on the processor implementation.
Software must use the deterministic cache parameter leaf of CPUID instruction to discover the cache-sharing
topology between the logical processors in a multi-threading environment.
10-34 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
cores or different physical packages. Either logical processor that has access to the microcode update facility can
initiate an update.
Each logical processor has its own BIOS signature MSR (IA32_BIOS_SIGN_ID at MSR address 8BH). When a logical
processor performs an update for the physical processor, the IA32_BIOS_SIGN_ID MSRs for resident logical
processors are updated with identical information.
All microcode update steps during processor initialization should use the same update data on all cores in all phys-
ical packages of the same stepping. Any subsequent microcode update must apply consistent update data to all
cores in all physical packages of the same stepping. If the processor detects an attempt to load an older microcode
update when a newer microcode update had previously been loaded, it may reject the older update to stay with the
newer update.
NOTE
Some processors (prior to the introduction of Intel 64 Architecture and based on Intel NetBurst
microarchitecture) do not support simultaneous loading of microcode update to the sibling logical
processors in the same core. All other processors support logical processors initiating an update
simultaneously. Intel recommends a common approach that the microcode loader use the
sequential technique described in Section 11.11.6.3.
Vol. 3A 10-35
MULTIPLE-PROCESSOR MANAGEMENT
• Module — A set of cores that share certain resources. The MODULE_ID sub-field distinguishes different
modules. If there are no software visible modules, the width of this bit field is 0.
• Core — Processor cores may be contained within modules, within tiles, on software-visible die, or appear
directly at the package domain. The CORE_ID sub-field distinguishes processor cores. For a single-core
processor, the width of this bit field is 0.
• Logical Processor — A processor core provides one or more logical processors sharing execution resources.
The LOGICAL_PROCESSOR_ID sub-field distinguishes logical processors in a core. The width of this bit field is
non-zero if a processor core provides more than one logical processors.
The LOGICAL_PROCESSOR_ID and CORE_ID sub-fields are bit-wise contiguous in the APIC_ID field (see
Figure 10-5).
X 0
Reserved
CLUSTER_ID
PACKAGE_ID
DIE_ID
TILE_ID
MODULE_ID
CORE_ID
LOGICAL_PROCESSOR_ID
If the processor supports CPUID leaf 0BH and leaf 1FH, the 32-bit APIC ID can represent cluster plus several
domains of topology within the physical processor package. The exact number of hierarchical domains within a
physical processor package must be enumerated through CPUID leaf 0BH and leaf 1FH. Common processor fami-
lies may employ a topology similar to that represented by the 8-bit Initial APIC ID. In general, CPUID leaf 0BH and
leaf 1FH can support a topology enumeration algorithm that decompose a 32-bit APIC ID into more than four sub-
fields (see Figure 10-6).
NOTE
CPUID leaf 0BH and leaf 1FH can have differences in the number of domain types reported (CPUID
leaf 1FH defines additional domain types). If the processor supports CPUID leaf 1FH, usage of this
leaf is preferred over leaf 0BH. CPUID leaf 0BH is available for legacy compatibility going forward.
The width of each sub-field depends on hardware and software configurations. Field widths can be determined at
runtime using the algorithm discussed below (Example 10-16 through Example 10-21).
Figure 7-6 depicts the relationships of three of the hierarchical sub-fields in a hypothetical MP system. The value of
valid APIC_IDs need not be contiguous across package boundary or core boundaries.
10-36 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
PACKAGE 31 0
Q
R
CORE Reserved
LOGICAL
PROCESSOR
CLUSTER_ID
PACKAGE_ID
Q_ID
R_ID
Word NumberOfDomainsBelowPackage = 0;
DWord Subleaf = 0;
EAX = 0BH or 1FH; // query each sub leaf of CPUID leaf 0BH or 1FH; CPUID leaf 1FH is preferred over leaf 0BH if available.
ECX = Subleaf;
CPUID;
while(EBX != 0) // Enumerate until EBX reports 0
{
if(EAX[4:0] != 0) // A Shift Value of 0 indicates this domain does not exist.
// (Such as no SMT_ID, which is required entry at sub-leaf 0.)
{
NumberOfDomainsBelowPackage++;
}
Subleaf++;
EAX = 0BH or 1FH;
ECX = Subleaf;
CPUID;
}
// NumberOfDomainsBelowPackage contains the absolute number of domains that exist below package.
N = Subleaf; // Sub-leaf supplies the number of entries CPUID will return.
Vol. 3A 10-37
MULTIPLE-PROCESSOR MANAGEMENT
• Sub-leaf index 0 (ECX= 0 as input) provides enumeration parameters to extract the LOGICAL_PROCESSOR_ID
sub-field of x2APIC ID. If EAX = 0BH or 1FH, and ECX =0 is specified as input when executing CPUID,
CPUID.(EAX=0BH or 1FH, ECX=0):EAX[4:0] reports a value (a right-shift count) that allow software to extract
part of x2APIC ID to distinguish the next higher topological entities above the LOGICAL_PROCESSOR_ID
domain. This value also corresponds to the bit-width of the sub-field of x2APIC ID corresponding the hierar-
chical domain with sub-leaf index 0.
• For each subsequent higher sub-leaf index m, CPUID.(EAX=0BH or 1FH, ECX=m):EAX[4:0] reports the right-
shift count that will allow software to extract part of x2APIC ID to distinguish higher-domain topological
entities. This means the right-shift value at of sub-leaf m, corresponds to the least significant (m+1) sub-fields
of the 32-bit x2APIC ID.
NOTE
CPUID leaf 1FH is a preferred superset to leaf 0BH. Leaf 1FH defines additional domain types, and
it must be parsed by an algorithm that can handle the addition of future domain types.
Previously, only the following encoding of hierarchical domain types were defined: 0 (invalid), 1 (logical processor),
and 2 (core). With the additional hierarchical domain types available (see Section 10.9.1, “Hierarchical Mapping of
Shared Resources,” and Figure 10-5, “Generalized Seven-Domain Interpretation of the APIC ID” ) software must
not assume any “domain type” encoding value to be related to any sub-leaf index, except sub-leaf 0.
Example 10-18. Support Routines for Identifying Package, Die, Core, and Logical Processors from 32-bit x2APIC ID
a. Derive the extraction bitmask for logical processors in a processor core and associated mask offset for different
cores.
//
// This example shows how to enumerate CPU topology domain types (domain types may or may not be known/supported by the
software)
//
// Below is the list of sample domain types used in the example.
// Refer to the CPUID Leaf 1FH definition for the actual domain type numbers: “V2 Extended Topology Enumeration Leaf (Initial EAX
Value = 1FH, ECX ≥ 0)” .
//
// LOGICAL PROCESSOR
// CORE
// MODULE
// TILE
// DIE
// PACKAGE
//
// The example shows how to identify and derive the extraction bitmask for the domains with identify type
LOGICAL_PROCESSOR_ID/CORE_ID/DIE_ID/PACKAGE_ID
//
10-38 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
b. Derive the extraction bitmask for processor cores in a physical processor package and associated mask offset for
different packages.
ECX++;
execute cpuid with EAX = 0BH or 1FH;
}
//
// Treat domains between DIE and physical package as an extension of DIE for software choosing not to implement or recognize
// these unknown domains.
//
return -1;
Vol. 3A 10-39
MULTIPLE-PROCESSOR MANAGEMENT
LOGICAL_PROCESSOR_ID
CORE_ID
T0 T1 T0 T1 T0 T1 T0 T1
PACKAGE_ID
Core 0 Core 1 Core 0 Core 1
Package 0 Package 1
Table 10-2. Initial APIC IDs for the Logical Processors in a System that has Four Intel Xeon MP Processors
Supporting Intel Hyper-Threading Technology1
Initial APIC ID PACKAGE_ID CORE_ID LOGICAL_PROCESSOR_ID
0H 0H 0H 0H
1H 0H 0H 1H
2H 1H 0H 0H
3H 1H 0H 1H
4H 2H 0H 0H
5H 2H 0H 1H
6H 3H 0H 0H
7H 3H 0H 1H
NOTE:
1. Because information on the number of processor cores in a physical package was not available in early single-core processors sup-
porting Intel Hyper-Threading Technology, the CORE_ID can be treated as 0.
Table 10-3 shows the initial APIC IDs for a hypothetical situation with a dual processor system. Each physical
package providing two processor cores, and each processor core also supporting Intel Hyper-Threading Tech-
nology.
10-40 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
Table 10-3. Initial APIC IDs for the Logical Processors in a System that has Two Physical Processors Supporting
Dual-Core and Intel Hyper-Threading Technology
Initial APIC ID PACKAGE_ID CORE_ID LOGICAL_PROCESSOR_ID
0H 0H 0H 0H
1H 0H 0H 1H
2H 0H 1H 0H
3H 0H 1H 1H
4H 1H 0H 0H
5H 1H 0H 1H
6H 1H 1H 0H
7H 1H 1H 1H
Table 10-4. Example of Possible x2APIC ID Assignment in a System that has Two Physical Processors Supporting
x2APIC and Intel Hyper-Threading Technology
x2APIC ID PACKAGE_ID CORE_ID LOGICAL_PROCESSOR_ID
0H 0H 0H 0H
1H 0H 0H 1H
2H 0H 1H 0H
3H 0H 1H 1H
4H 0H 2H 0H
5H 0H 2H 1H
6H 0H 3H 0H
7H 0H 3H 1H
10H 1H 0H 0H
11H 1H 0H 1H
12H 1H 1H 0H
13H 1H 1H 1H
14H 1H 2H 0H
15H 1H 2H 1H
16H 1H 3H 0H
17H 1H 3H 1H
Vol. 3A 10-41
MULTIPLE-PROCESSOR MANAGEMENT
1. As noted in Section 10.6 and Section 10.9.3, the number of logical processors supported by the OS at runtime may be less than the
total number logical processors available in the platform hardware.
2. Maximum number of addressable ID for processor cores in a physical processor is obtained by executing CPUID with EAX=4 and a
valid ECX index. The ECX index starts at 0.
3. Maximum number addressable ID for processor cores sharing the target cache level is obtained by executing CPUID with EAX = 4
and the ECX index corresponding to the target cache level.
10-42 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
4. If the processor does not support CPUID leaf 0BH, each initial APIC ID contains an 8-bit value, the topology
enumeration parameters needed to derive extraction bit masks are:
a. Query the size of address space for sub IDs that can accommodate logical processors in a physical
processor package. This size parameters (CPUID.1:EBX[23:16]) can be used to derive the width of an
extraction bitmask to enumerate the sub IDs of different logical processors in the same physical package.
b. Query the size of address space for sub IDs that can accommodate processor cores in a physical processor
package. This size parameters can be used to derive the width of an extraction bitmask to enumerate the
sub IDs of processor cores in the same physical package.
c. Query the 8-bit initial APIC ID for the logical processor where the current thread is executing.
d. Derive the extraction bit masks using respective address sizes corresponding to LOGICAL_PROCESSOR_ID,
CORE_ID, and PACKAGE_ID, starting from LOGICAL_PROCESSOR_ID.
e. Apply each extraction bit mask to the 8-bit initial APIC ID to extract sub-field IDs.
Example 10-19. Support Routines for Detecting Hardware Multi-Threading and Identifying the Relationships Between
Package, Core, and Logical Processors
1. Detect support for Hardware Multi-Threading Support in a processor.
Example 10-20. Support Routines for Identifying Package, Core, and Logical Processors from 32-bit x2APIC ID
a. Derive the extraction bitmask for logical processors in a processor core and associated mask offset for different
cores.
Vol. 3A 10-43
MULTIPLE-PROCESSOR MANAGEMENT
return 0;
}
b. Derive the extraction bitmask for processor cores in a physical processor package and associated mask offset for
different packages.
Example 10-21. Support Routines for Identifying Package, Core, and Logical Processors from 8-bit Initial APIC ID
a. Find the size of address space for logical processors in a physical processor package.
//Returns the size of address space of logical processors in a physical processor package;
// Software should not assume the value to be a power of 2.
10-44 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
b. Find the size of address space for processor cores in a physical processor package.
// Returns the max number of addressable IDs for processor cores in a physical processor package;
// Software should not assume cpuid reports this value to be a power of 2.
unsigned MaxCoreIDsPerPackage(void)
{
if (!HWMTSupported()) return (unsigned char) 1;
if cpuid supports leaf number 4
{ // we can retrieve multi-core topology info using leaf 4
execute cpuid with eax = 4, ecx = 0
store returned value of eax
return (unsigned) ((reg_eax >> 26) +1);
}
else // must be a single-core processor
return 1;
}
c. Query the initial APIC ID of a logical processor.
// Returns the 8-bit unique initial APIC ID for the processor running the code.
// Software can use OS services to affinitize the current thread to each logical processor
// available under the OS to gather the initial APIC_IDs for each logical processor.
// Returns the mask bit width of a bit field from the maximum count that bit field can represent.
// This algorithm does not assume ‘address size’ to have a value equal to power of 2.
// Address size for LOGICAL_PROCESSOR_ID can be calculated from MaxLPIDsPerPackage()/MaxCoreIDsPerPackage()
// Then use the routine below to derive the corresponding width of logical processor extraction bitmask
// Address size for CORE_ID is MaxCoreIDsPerPackage(),
// Derive the bitwidth for CORE extraction mask similarly
Vol. 3A 10-45
MULTIPLE-PROCESSOR MANAGEMENT
}
return mask_width;
}
e. Extract a sub ID from an 8-bit full ID, using address size of the sub ID and shift count.
// The routine below can extract LOGICAL_PROCESSOR_ID, CORE_ID, and PACKAGE_ID respectively from the init APIC_ID
// To extract LOGICAL_PROCESSOR_ID, MaxSubIDvalue is set to the address size of LOGICAL_PROCESSOR_ID, Shift_Count = 0
// To extract CORE_ID, MaxSubIDvalue is the address size of CORE_ID, Shift_Count is width of logical processor extraction bitmask.
// Returns the value of the sub ID, this is not a zero-based value
Unsigned char GetSubID(unsigned char Full_ID, unsigned char MaxSubIDvalue, unsigned char Shift_Count)
{
MaskWidth = FindMaskWidth(MaxSubIDValue);
MaskBits = ((uchar) (FFH << Shift_Count)) ^ ((uchar) (FFH << Shift_Count + MaskWidth)) ;
SubID = Full_ID & MaskBits;
Return SubID;
}
Software must not assume local APIC_ID values in an MP system are consecutive. Non-consecutive local APIC_IDs
may be the result of hardware configurations or debug features implemented in the BIOS or OS.
An identifier for each hierarchical domain can be extracted from an 8-bit APIC_ID using the support routines illus-
trated in Example 10-21. The appropriate bit mask and shift value to construct the appropriate bit mask for each
domain must be determined dynamically at runtime.
10-46 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
• To detect the number of processor cores: use CORE_ID to identify those logical processors that reside in the
same core. This is shown in Example 10-23. This example also depicts a technique to construct a mask to
represent the logical processors that reside in the same core.
In Example 10-22, the numerical ID value can be obtained from the value extracted with the mask by shifting it
right by shift count. Algorithms below do not shift the value. The assumption is that the SubID values can be
compared for equivalence without the need to shift.
// Extract CORE_ID:
// Core Mask is determined in Example 10-20 or Example 10-21
CORE_ID = (APIC_ID & Core Mask);
// Extract PACKAGE_ID:
// Assume single cluster.
// Shift out the mask width for maximum logical processors per package
// Package Mask is determined in Example 10-20 or Example 10-21
PACKAGE_ID = (APIC_ID & Package Mask) ;
}
Example 10-23. Compute the Number of Packages, Cores, and Processor Relationships in a MP System
a) Assemble lists of PACKAGE_ID, CORE_ID, and LOGICAL_PROCESSOR_ID of each enabled logical processors
// The BIOS and/or OS may limit the number of logical processors available to applications after system boot.
// The below algorithm will compute topology for the processors visible to the thread that is computing it.
ThreadAffinityMask = 1;
ProcessorNum = 0;
while (ThreadAffinityMask ≠ 0 && ThreadAffinityMask <= SystemAffinity) {
// Check to make sure we can utilize this processor first.
if (ThreadAffinityMask & SystemAffinity){
Set thread to run on the processor specified in ThreadAffinityMask
Wait if necessary and ensure thread is running on specified processor
Vol. 3A 10-47
MULTIPLE-PROCESSOR MANAGEMENT
LOGICAL_PROCESSOR_ID[ProcessorNum] = LOGICAL_PROCESSOR_ID;
ProcessorNum++;
}
ThreadAffinityMask <<= 1;
}
NumStartedLPs = ProcessorNum;
b) Using the list of PACKAGE_ID to count the number of physical packages in a MP system and construct, for each package, a multi-bit
mask corresponding to those logical processors residing in the same package.
// Compute the number of packages by counting the number of processors with unique PACKAGE_IDs in the PackageID array.
// Compute the mask of processors in each package.
// PackageIDBucket is an array of unique PACKAGE_ID values. Allocate an array of NumStartedLPs count of entries in this array.
// PackageProcessorMask is a corresponding array of the bit mask of processors belonging to the same package, these are
// processors with the same PACKAGE_ID.
// The algorithm below assumes there is symmetry across package boundary if more than one socket is populated in an MP
//system.
// Bucket Package IDs and compute processor mask for every package.
PackageNum = 1;
PackageIDBucket[0] = PackageID[0];
ProcessorMask = 1;
PackageProcessorMask[0] = ProcessorMask;
For (ProcessorNum = 1; ProcessorNum < NumStartedLPs; ProcessorNum++) {
ProcessorMask << = 1;
For (i=0; i < PackageNum; i++) {
// we may be comparing bit-fields of logical processors residing in different
// packages, the code below assume package symmetry
If (PackageID[ProcessorNum] = PackageIDBucket[i]) {
PackageProcessorMask[i] |= ProcessorMask;
Break; // found in existing bucket, skip to next iteration
}
}
if (i =PackageNum) {
//PACKAGE_ID did not match any bucket, start new bucket
PackageIDBucket[i] = PackageID[ProcessorNum];
PackageProcessorMask[i] = ProcessorMask;
PackageNum++;
}
}
// PackageNum has the number of Packages started in OS
// PackageProcessorMask[] array has the processor set of each package
c) Using the list of CORE_ID to count the number of cores in a MP system and construct, for each core, a multi-bit mask corresponding
to those logical processors residing in the same core.
Processors in the same core can be determined by bucketing the processors with the same PACKAGE_ID and CORE_ID. Note that code
below can BIT OR the values of PACKGE and CORE ID because they have not been shifted right.
The algorithm below assumes there is symmetry across package boundary if more than one socket is populated in an MP system.
//Bucketing PACKAGE and CORE IDs and computing processor mask for every core
CoreNum = 1;
CoreIDBucket[0] = PackageID[0] | CoreID[0];
ProcessorMask = 1;
10-48 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
CoreProcessorMask[0] = ProcessorMask;
For (ProcessorNum = 1; ProcessorNum < NumStartedLPs; ProcessorNum++) {
ProcessorMask << = 1;
For (i=0; i < CoreNum; i++) {
// we may be comparing bit-fields of logical processors residing in different
// packages, the code below assume package symmetry
If ((PackageID[ProcessorNum] | CoreID[ProcessorNum]) = CoreIDBucket[i]) {
CoreProcessorMask[i] |= ProcessorMask;
Break; // found in existing bucket, skip to next iteration
}
}
if (i = CoreNum) {
//Did not match any bucket, start new bucket
CoreIDBucket[i] = PackageID[ProcessorNum] | CoreID[ProcessorNum];
CoreProcessorMask[i] = ProcessorMask;
CoreNum++;
}
}
// CoreNum has the number of cores started in the OS
// CoreProcessorMask[] array has the processor set of each core
Other processor relationships such as processor mask of sibling cores can be computed from set operations of the
PackageProcessorMask[] and CoreProcessorMask[].
The algorithm shown above can be adapted to work with earlier generations of single-core IA-32 processors that
support Intel Hyper-Threading Technology and in situations that the deterministic cache parameter leaf is not
supported (provided CPUID supports initial APIC ID). A reference code example is available (see Intel® 64 Archi-
tecture Processor Topology Enumeration Technical Paper).
Vol. 3A 10-49
MULTIPLE-PROCESSOR MANAGEMENT
this hint to avoid the memory order violation and prevent the pipeline flush. In addition, the PAUSE instruction de-
pipelines the spin-wait loop to prevent it from consuming execution resources excessively and consume power
needlessly. (See Section 10.10.6.1, “Use the PAUSE Instruction in Spin-Wait Loops,” for more information about
using the PAUSE instruction with IA-32 processors supporting Intel Hyper-Threading Technology.)
10-50 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
Both instructions rely on the state of the processor’s monitor hardware. The monitor hardware can be either armed
(by executing the MONITOR instruction) or triggered (due to a variety of events, including a store to the monitored
memory region). If upon execution of MWAIT, monitor hardware is in a triggered state: MWAIT behaves as a NOP
and execution continues at the next instruction in the execution stream. The state of monitor hardware is not archi-
tecturally visible except through the behavior of MWAIT.
Multiple events other than a write to the triggering address range can cause a processor that executed MWAIT to
wake up. These include events that would lead to voluntary or involuntary context switches, such as:
• External interrupts, including NMI, SMI, INIT, BINIT, MCERR, A20M#
• Faults, Aborts (including Machine Check)
• Architectural TLB invalidations including writes to CR0, CR3, CR4, and certain MSR writes; execution of LMSW
(occurring prior to issuing MWAIT but after setting the monitor)
• Voluntary transitions due to fast system call and far calls (occurring prior to issuing MWAIT but after setting the
monitor)
Power management related events (such as Thermal Monitor 2 or chipset driven STPCLK# assertion) will not cause
the monitor event pending flag to be cleared. Faults will not cause the monitor event pending flag to be cleared.
Software should not allow for voluntary context switches in between MONITOR/MWAIT in the instruction flow. Note
that execution of MWAIT does not re-arm the monitor hardware. This means that MONITOR/MWAIT need to be
executed in a loop. Also note that exits from the MWAIT state could be due to a condition other than a write to the
triggering address; software should explicitly check the triggering data location to determine if the write occurred.
Software should also check the value of the triggering address following the execution of the monitor instruction
(and prior to the execution of the MWAIT instruction). This check is to identify any writes to the triggering address
that occurred during the course of MONITOR execution.
The address range provided to the MONITOR instruction must be of write-back caching type. Only write-back
memory type stores to the monitored address range will trigger the monitor hardware. If the address range is not
in memory of write-back type, the address monitor hardware may not be set up properly or the monitor hardware
may not be armed. Software is also responsible for ensuring that
• Writes that are not intended to cause the exit of a busy loop do not write to a location within the address region
being monitored by the monitor hardware,
• Writes intended to cause the exit of a busy loop are written to locations within the monitored address region.
Not doing so will lead to more false wakeups (an exit from the MWAIT state not due to a write to the intended data
location). These have negative performance implications. It might be necessary for software to use padding to
prevent false wakeups. CPUID provides a mechanism for determining the size data locations for monitoring as well
as a mechanism for determining the size of a the pad.
Vol. 3A 10-51
MULTIPLE-PROCESSOR MANAGEMENT
dynamically allocated data buffer for thread synchronization. When the latter technique is not possible, consider
not using MONITOR/MWAIT when using static data structures.
To set up the data structure correctly for MONITOR/MWAIT on multi-clustered systems: interaction between
processors, chipsets, and the BIOS is required (system coherence line size may depend on the chipset used in the
system; the size could be different from the processor’s monitor triggering area). The BIOS is responsible to set the
correct value for system coherence line size using the IA32_MONITOR_FILTER_LINE_SIZE MSR. Depending on the
relative magnitude of the size of the monitor triggering area versus the value written into the IA32_MONITOR_FIL-
TER_LINE_SIZE MSR, the smaller of the parameters will be reported as the Smallest Monitor Line Size. The larger
of the parameters will be reported as the Largest Monitor Line Size.
Spin_Lock:
CMP lockvar, 0 ;Check if lock is free
JE Get_Lock
PAUSE ;Short delay
JMP Spin_Lock
Get_Lock:
MOV EAX, 1
XCHG EAX, lockvar ;Try to get lock
CMP EAX, 0 ;Test if successful
JNE Spin_Lock
Critical_Section:
<critical section code>
MOV lockvar, 0
...
Continue:
The spin-wait loop above uses a “test, test-and-set” technique for determining the availability of the synchroniza-
tion variable. This technique is recommended when writing spin-wait loops.
In IA-32 processor generations earlier than the Pentium 4 processor, the PAUSE instruction is treated as a NOP
instruction.
10-52 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
WHILE (1) {
IF (WorkQueue) THEN {
// Schedule work at WorkQueue.
}
ELSE {
// No work to do - wait in appropriate C-state handler depending
// on Idle time accumulated
IF (IdleTime >= IdleTimeThreshhold) THEN {
// Call appropriate C1, C2, C3 state handler, C1 handler
// shown below
}
}
}
// C1 handler uses a Halt instruction
VOID C1Handler()
{ STI
HLT
}
The MONITOR and MWAIT instructions may be considered for use in the C0 idle state loops, if MONITOR and MWAIT are supported.
WHILE (1) {
IF (WorkQueue) THEN {
// Schedule work at WorkQueue.
}
ELSE {
// No work to do - wait in appropriate C-state handler depending
// on Idle time accumulated.
IF (IdleTime >= IdleTimeThreshhold) THEN {
// Call appropriate C1, C2, C3 state handler, C1
// handler shown below
MONITOR WorkQueue // Setup of eax with WorkQueue
// LinearAddress,
// ECX, EDX = 0
IF (WorkQueue = 0) THEN {
MWAIT
}
}
}
Vol. 3A 10-53
MULTIPLE-PROCESSOR MANAGEMENT
}
// C1 handler uses a Halt instruction.
VOID C1Handler()
{ STI
HLT
}
WHILE (1) {
IF (WorkQueue) THEN {
// Schedule work at WorkQueue
}
ELSE {
// No work to do - wait in appropriate C-state handler depending
// on Idle time accumulated
IF (IdleTime >= IdleTimeThreshhold) THEN {
// Call appropriate C1, C2, C3 state handler, C1
// handler shown below
}
}
}
VOID C1Handler()
1. Excessive transitions into and out of the HALT state could also incur performance penalties. Operating systems should evaluate the
performance trade-offs for their operating system.
10-54 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
10.10.6.5 Guidelines for Scheduling Threads on Logical Processors Sharing Execution Resources
Because the logical processors, the order in which threads are dispatched to logical processors for execution can
affect the overall efficiency of a system. The following guidelines are recommended for scheduling threads for
execution.
• Dispatch threads to one logical processor per processor core before dispatching threads to the other logical
processor sharing execution resources in the same processor core.
• In an MP system with two or more physical packages, distribute threads out over all the physical processors,
rather than concentrate them in one or two physical processors.
• Use processor affinity to assign a thread to a specific processor core or package, depending on the cache-
sharing topology. The practice increases the chance that the processor’s caches will contain some of the
thread’s code and data when it is dispatched for execution after being suspended.
Vol. 3A 10-55
MULTIPLE-PROCESSOR MANAGEMENT
For BIPI messages, the lower 4 bits of the vector field contain the APIC ID of the processor issuing the message and
the upper 4 bits contain the “generation ID” of the message. All P6 family processor will have a generation ID of 4H.
BIPIs will therefore use vector values ranging from 40H to 4EH (4FH can not be used because FH is not a valid APIC
ID).
10-56 Vol. 3A
MULTIPLE-PROCESSOR MANAGEMENT
1. Each processor on the system bus is assigned a unique APIC ID, based on system topology (see Section 10.4.5,
“Identifying Logical Processors in an MP System”). This ID is written into the local APIC ID register for each
processor.
2. Each processor executes its internal BIST simultaneously with the other processors on the system bus. Upon
completion of the BIST (at T0), each processor broadcasts a BIPI to “all including self” (see Figure 10-8).
3. APIC arbitration hardware causes all the APICs to respond to the BIPIs one at a time (at T1, T2, T3, and T4).
4. When the first BIPI is received (at time T1), each APIC compares the four least significant bits of the BIPI’s
vector field with its APIC ID. If the vector and APIC ID match, the processor selects itself as the BSP by setting
the BSP flag in its IA32_APIC_BASE MSR. If the vector and APIC ID do not match, the processor selects itself
as an AP by entering the “wait for SIPI” state. (Note that in Figure 10-8, the BIPI from processor 1 is the first
BIPI to be handled, so processor 1 becomes the BSP.)
5. The newly established BSP broadcasts an FIPI message to “all including self.” The FIPI is guaranteed to be
handled only after the completion of the BIPIs that were issued by the non-BSP processors.
APIC Bus
Processor 1
Becomes BSP
T0 T1 T2 T3 T4 T5
6. After the BSP has been established, the outstanding BIPIs are received one at a time (at T2, T3, and T4) and
ignored by all processors.
7. When the FIPI is finally received (at T5), only the BSP responds to it. It responds by fetching and executing
BIOS boot-strap code, beginning at the reset vector (physical address FFFF FFF0H).
8. As part of the boot-strap code, the BSP creates an ACPI table and an MP table and adds its initial APIC ID to
these tables as appropriate.
9. At the end of the boot-strap procedure, the BSP broadcasts a SIPI message to all the APs in the system. Here,
the SIPI message contains a vector to the BIOS AP initialization code (at 000V V000H, where VV is the vector
contained in the SIPI message).
10. All APs respond to the SIPI message by racing to a BIOS initialization semaphore. The first one to the
semaphore begins executing the initialization code. (See MP init code for semaphore implementation details.)
As part of the AP initialization procedure, the AP adds its APIC ID number to the ACPI and MP tables as appro-
priate. At the completion of the initialization procedure, the AP executes a CLI instruction (to clear the IF flag in
the EFLAGS register) and halts itself.
11. When each of the APs has gained access to the semaphore and executed the AP initialization code and all
written their APIC IDs into the appropriate places in the ACPI and MP tables, the BSP establishes a count for the
number of processors connected to the system bus, completes executing the BIOS boot-strap code, and then
begins executing operating-system boot-strap and start-up code.
Vol. 3A 10-57
MULTIPLE-PROCESSOR MANAGEMENT
12. While the BSP is executing operating-system boot-strap and start-up code, the APs remain in the halted state.
In this state they will respond only to INITs, NMIs, and SMIs. They will also respond to snoops and to assertions
of the STPCLK# pin.
See Section 10.4.4, “MP Initialization Example,” for an annotated example the use of the MP protocol to boot IA-32
processors in an MP. This code should run on any IA-32 processor that used the MP protocol.
10-58 Vol. 3A
18.Updates to Chapter 18, Volume 3B
Change bars and violet text show changes to Chapter 18 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3B: System Programming Guide, Part 2.
------------------------------------------------------------------------------------------
Changes to this chapter:
• Updated Section 18.10, “Incremental Decoding Information: Processor Family with CPUID
DisplayFamily_DisplayModel Signature 06_5FH, Machine Error Codes For Machine Check,” to correct the
register banks that error codes are reported in from IA32_MC6 and IA32_MC7 to IA32_MC7 and IA32_MC8.
Similar updates were made to Section 18.10.1, “Integrated Memory Controller Machine Check Errors.”
Encoding of the model-specific and other information fields is different across processor families. The differences
are documented in the following sections.
These errors are reported in the IA32_MCi_STATUS MSRs. They are reported architecturally as compound errors
with a general form of 0000 1PPT RRRR IILL in the MCA error code field. See Chapter 17 for information on the
interpretation of compound error codes. Incremental decoding information is listed in Table 18-2.
Table 18-2. Incremental Decoding Information: Processor Family 06H Machine Error Codes for Machine Check
Type Bit No. Bit Function Bit Description
MCA Error 15:0
Codes1
Model Specific 18:16 Reserved Reserved
Errors
24:19 Bus Queue Request 000000: BQ_DCU_READ_TYPE error.
Type 000010: BQ_IFU_DEMAND_TYPE error.
000011: BQ_IFU_DEMAND_NC_TYPE error.
000100: BQ_DCU_RFO_TYPE error.
000101: BQ_DCU_RFO_LOCK_TYPE error.
000110: BQ_DCU_ITOM_TYPE error.
001000: BQ_DCU_WB_TYPE error.
001010: BQ_DCU_WCEVICT_TYPE error.
001011: BQ_DCU_WCLINE_TYPE error.
001100: BQ_DCU_BTM_TYPE error.
Vol. 3B 18-1
INTERPRETING MACHINE CHECK ERROR CODES
Table 18-2. Incremental Decoding Information: Processor Family 06H Machine Error Codes for Machine Check
Type Bit No. Bit Function Bit Description
001101: BQ_DCU_INTACK_TYPE error.
001110: BQ_DCU_INVALL2_TYPE error.
001111: BQ_DCU_FLUSHL2_TYPE error.
010000: BQ_DCU_PART_RD_TYPE error.
010010: BQ_DCU_PART_WR_TYPE error.
010100: BQ_DCU_SPEC_CYC_TYPE error.
011000: BQ_DCU_IO_RD_TYPE error.
011001: BQ_DCU_IO_WR_TYPE error.
011100: BQ_DCU_LOCK_RD_TYPE error.
011110: BQ_DCU_SPLOCK_RD_TYPE error.
011101: BQ_DCU_LOCK_WR_TYPE error.
27:25 Bus Queue Error Type 000: BQ_ERR_HARD_TYPE error.
001: BQ_ERR_DOUBLE_TYPE error.
010: BQ_ERR_AERR2_TYPE error.
100: BQ_ERR_SINGLE_TYPE error.
101: BQ_ERR_AERR1_TYPE error.
28 FRC Error 1 if FRC error active.
29 BERR 1 if BERR is driven.
30 Internal BINIT 1 if BINIT driven for this processor.
31 Reserved Reserved
Other 34:32 Reserved Reserved
Information
35 External BINIT 1 if BINIT is received from external bus.
36 Response Parity Error This bit is asserted in IA32_MCi_STATUS if this component has received a parity
error on the RS[2:0]# pins for a response transaction. The RS signals are checked
by the RSP# external pin.
37 Bus BINIT This bit is asserted in IA32_MCi_STATUS if this component has received a hard
error response on a split transaction one access that has needed to be split across
the 64-bit external bus interface into two accesses).
38 Timeout BINIT This bit is asserted in IA32_MCi_STATUS if this component has experienced a ROB
time-out, which indicates that no micro-instruction has been retired for a
predetermined period of time.
A ROB time-out occurs when the 15-bit ROB time-out counter carries a 1 out of its
high order bit. 2 The timer is cleared when a micro-instruction retires, an exception
is detected by the core processor, RESET is asserted, or when a ROB BINIT occurs.
The ROB time-out counter is prescaled by the 8-bit PIC timer which is a divide by
128 of the bus clock (the bus clock is 1:2, 1:3, 1:4 of the core clock3). When a carry
out of the 8-bit PIC timer occurs, the ROB counter counts up by one. While this bit is
asserted, it cannot be overwritten by another error.
41:39 Reserved Reserved
42 Hard Error This bit is asserted in IA32_MCi_STATUS if this component has initiated a bus
transactions which has received a hard error response. While this bit is asserted, it
cannot be overwritten.
43 IERR This bit is asserted in IA32_MCi_STATUS if this component has experienced a
failure that causes the IERR pin to be asserted. While this bit is asserted, it cannot
be overwritten.
18-2 Vol. 3B
INTERPRETING MACHINE CHECK ERROR CODES
Table 18-2. Incremental Decoding Information: Processor Family 06H Machine Error Codes for Machine Check
Type Bit No. Bit Function Bit Description
44 AERR This bit is asserted in IA32_MCi_STATUS if this component has initiated 2 failing
bus transactions which have failed due to Address Parity Errors AERR asserted).
While this bit is asserted, it cannot be overwritten.
45 UECC The Uncorrectable ECC error bit is asserted in IA32_MCi_STATUS for uncorrected
ECC errors. While this bit is asserted, the ECC syndrome field will not be
overwritten.
46 CECC The correctable ECC error bit is asserted in IA32_MCi_STATUS for corrected ECC
errors.
54:47 ECC Syndrome The ECC syndrome field in IA32_MCi_STATUS contains the 8-bit ECC syndrome only
if the error was a correctable/uncorrectable ECC error and there wasn't a previous
valid ECC error syndrome logged in IA32_MCi_STATUS.
A previous valid ECC error in IA32_MCi_STATUS is indicated by
IA32_MCi_STATUS.bit45 uncorrectable error occurred) being asserted. After
processing an ECC error, machine check handling software should clear
IA32_MCi_STATUS.bit45 so that future ECC error syndromes can be logged.
56:55 Reserved Reserved
Status Register 63:57
Validity
Indicators1
NOTES:
1. These fields are architecturally defined. Refer to Chapter 17, “Machine-Check Architecture,” for more information.
2. For processors with a CPUID signature of 06_0EH, a ROB time-out occurs when the 23-bit ROB time-out counter carries a 1 out of its
high order bit.
3. For processors with a CPUID signature of 6_06_60H and later, the PIC timer will count crystal clock cycles.
Table 18-3. CPUID DisplayFamily_DisplayModel Signatures for Processors Based on Intel® Core™ Microarchitecture
DisplayFamily_DisplayModel Processor Families/Processor Number Series
06_1DH Intel® Xeon® Processor 7400 series
06_17H Intel® Xeon® Processor 5200, 5400 series, Intel® Core™ 2 Quad processor Q9650
06_0FH Intel® Xeon® Processor 3000, 3200, 5100, 5300, 7300 series, Intel® Core™ 2 Quad, Intel® Core™ 2
Extreme, Intel® Core™ 2 Duo processors, Intel Pentium dual-core processors
Vol. 3B 18-3
INTERPRETING MACHINE CHECK ERROR CODES
Table 18-4. Incremental Bus Error Codes of Machine Check for Processors
Based on Intel® Core™ Microarchitecture
Type Bit No. Bit Function Bit Description
MCA Error 15:0
Codes1
Model Specific 18:16 Reserved Reserved
Errors
24:19 Bus Queue Request ‘000001: BQ_PREF_READ_TYPE error.
Type 000000: BQ_DCU_READ_TYPE error.
000010: BQ_IFU_DEMAND_TYPE error
000011: BQ_IFU_DEMAND_NC_TYPE error.
000100: BQ_DCU_RFO_TYPE error.
000101: BQ_DCU_RFO_LOCK_TYPE error.
000110: BQ_DCU_ITOM_TYPE error.
001000: BQ_DCU_WB_TYPE error.
001010: BQ_DCU_WCEVICT_TYPE error.
001011: BQ_DCU_WCLINE_TYPE error.
001100: BQ_DCU_BTM_TYPE error.
001101: BQ_DCU_INTACK_TYPE error.
001110: BQ_DCU_INVALL2_TYPE error.
001111: BQ_DCU_FLUSHL2_TYPE error.
010000: BQ_DCU_PART_RD_TYPE error.
010010: BQ_DCU_PART_WR_TYPE error.
010100: BQ_DCU_SPEC_CYC_TYPE error.
011000: BQ_DCU_IO_RD_TYPE error.
011001: BQ_DCU_IO_WR_TYPE error.
011100: BQ_DCU_LOCK_RD_TYPE error.
011110: BQ_DCU_SPLOCK_RD_TYPE error.
011101: BQ_DCU_LOCK_WR_TYPE error.
100100: BQ_L2_WI_RFO_TYPE error.
100110: BQ_L2_WI_ITOM_TYPE error.
27:25 Bus Queue Error Type ‘001: Address Parity Error.
‘010: Response Hard Error.
‘011: Response Parity Error.
28 MCE Driven 1 if MCE is driven.
29 MCE Observed 1 if MCE is observed.
30 Internal BINIT 1 if BINIT driven for this processor.
31 BINIT Observed 1 if BINIT is observed for this processor.
Other 33:32 Reserved Reserved
Information
34 PIC and FSB Data Data Parity detected on either PIC or FSB access.
Parity
35 Reserved Reserved
36 Response Parity Error This bit is asserted in IA32_MCi_STATUS if this component has received a parity
error on the RS[2:0]# pins for a response transaction. The RS signals are checked
by the RSP# external pin.
18-4 Vol. 3B
INTERPRETING MACHINE CHECK ERROR CODES
Table 18-4. Incremental Bus Error Codes of Machine Check for Processors
Based on Intel® Core™ Microarchitecture (Contd.)
Type Bit No. Bit Function Bit Description
37 FSB Address Parity Address parity error detected:
1: Address parity error detected.
0: No address parity error.
38 Timeout BINIT This bit is asserted in IA32_MCi_STATUS if this component has experienced a ROB
time-out, which indicates that no micro-instruction has been retired for a
predetermined period of time.
A ROB time-out occurs when the 23-bit ROB time-out counter carries a 1 out